1 Preface

Welcome to Reproducible Medical Research with R (RMRWR). I hope that this book meets your needs.

1.1 Who This Book is For

This is a book for anyone in the medical field interested in analyzing the data available to them to better understand health, disease, or delivery of care. This could include nurses, dieticians, psychologists, and PhDs in related fields, as well as medical students, residents, fellows, or doctors in practice.
I expect that most learners will be using this book in their spare time at night and on weekends, as the medical school curriculum is already packed full of information, and there is no room to add skills in reproducible research to the standard curriculum. This book is designed for self-teaching, and many hints and solutions will be provided to avoid roadblocks and frustration. Many learners find themselves wanting to develop reproducible research skills after they have finished their training, and after they have become comfortable with their clinical role. This is the time when they identify and want to address problems faced by patients in their practice with the data they have before them. This book is for you.

1.2 Prerequisites

Thank you for giving this e-book a try. This is designed for physicians or others analyzing health data who are interested in pursuing this field using the R computer language.
We will assume that:

  • You have access to a computer
  • You have access to the internet
  • You can download and install software from the internet to your computer

How to download and install R and RStudio will be addressed, step by step, in Chapter 2.

1.3 The Spiral of Success Structure

This book is structured on the concept of a “spiral of success”, with readers learning about topics like data visualization, data wrangling, data modeling, reproducible research, and communication of results in repeated passes. These will initially be at a superficial level, and at each pass of the spiral, will provide increasing depth and complexity. This means that the chapters on data wrangling will not all be together, nor the chapters on data visualization. Our goal is to build skills gradually, and return to (and remind students of) their previously built skills in one area and to add to them. The eventual goal is for learners to be able to produce, document, and communicate reproducible research to their community.

1.4 Motivation for this Book

Most medical people who learn R to do their own data analysis do it on their own time. They rarely have time for a semester-long course, and their clinical schedules usually will not allow it. Fortunately, a lot of people learn R on their own, and there is a strong and supportive R Community to help new learners. A 2019 Twitter survey conducted by @RLadies found that more than half of respondents were largely self-taught, from books and online resources.

There are a lot of good resources for learning R, so why one more? In part, because the needs of a medical audience are often different. There are distinct needs for protecting health information, generating a descriptive Table One, using secure data tools like REDCap, and creating standard medical journal and meeting output in Word, Powerpoint, and poster formats.

More and more, all science is becoming data science. We are able to track patients, their test results, and even the individual pixels (voxels) of their CT scans electronically, and use those data points to develop new knowledge. While one could argue that health care workers should collect data and bring it to trained statisticians, this does not work nearly as well as you might expect. Most academic statisticians are incentivized to develop new statistical methods, and are not very interested (nor incentivized) to do the hand-holding required to wrangle messy clinical data into a manuscript.

There also are simply not enough statisticians to meet the needs of medical science. Having clinicians on the front lines with some data science training makes a big difference, whether in 1854 in London (John Snow) or in 2014 in Flint, Michigan (Mona Hanna-Atisha). Having more clinicians with some training will impact medical care, as they will identify local problems that would have otherwise never reached a statistician, and probably never been addressed with data otherwise.

1.5 The Scientific Reproducibility Crisis

Beginning as far back as 1989, with the David Baltimore case, and increasingly and publicly through the 2010s, there has been a rising tide of realization that a lot of taxpayer-funded science is done sloppily, and that our standards as scientists need to be higher. The line between carelessly-done science and outright fraud is a thin one, and the case can be made that doing science in a sloppy fashion defrauds the funders, as it leads to results that can not be reproduced by the authors nor replicated by others. Particularly in medicine, where incorrect findings can cause great harm, we should take special care to do scientific research which is well-documented, reproducible, and replicable. This topic as a motivating force for doing careful medical research will be expanded upon in Chapter 1.

1.6 Features of a Bookdown electronic book

1.6.1 Icons

There are several icons at the top left, to the right of the clickable RMWR link, that can be helpful: 1. The Table of Contents Sidebar - Click on the ‘hamburger’ menu icon (three horizontal lines) or the s key to toggle the sidebar (table of contents) on and off. Within the sidebar, you can click on whichever chapter or subsection you want. 2. This book is Searchable - Click on the magnifying glass or use the f key to toggle the Find box and search for whatever you need to find. 3. You can change the font size, font, and background by clicking on the A icon. 4. You can download the chapter with the download icon (downward arrow into a file tray) in PDF or EPUB formats.

1.6.2 Sharing

At the top right, there are several icons for sharing links to the current chapter through social media.

1.6.3 Scrolling/Paging

  1. You can scroll up and down within a chapter with your mouse, or use the up and down arrow keys.
  2. You can page through chapters with the left and right arrow keys.

1.7 What this Book is Not

1.7.1 This Book is Not A Statistics Text

This is not an introduction to statistics. I am assuming that you have learned some statistics somewhere in secondary school, undergraduate studies, graduate school, or even medical school. There are lots of statisticians with Ph.D.s who can certainly teach statistics much more effectively than I can. While I have a master’s degree in Clinical Research Design and Statistical Analysis (isn’t that a mouthful!) from the University of Michigan, I will leave formal teaching of statistics to the pros.
If you need to brush up on your statistics, no worries. There are several excellent (and free!) e-books on that very topic, using R. Some good examples include (go ahead and click through the blue links to explore):

  1. Learning Statistics with R (LSR)
  2. Open Intro Statistics
  3. Modern Dive
  4. Teacup Giraffes

We will cover much of the same material as these books, but with a less theoretical and more applied approach. I will focus on specific medical examples, and emphasize issues (like Protected Health Information) that are particularly important for medical data. I am assuming that you are here because you want to analyze your own data in your (probably) very limited free time.

1.7.2 This Book Does Not Provide Comprehensive Coverage of the R Universe

This book is also far from comprehensive in teaching what is available in the R ecosystem. This book should be considered a launch pad. Many of the later chapters will give you a taste of what is available in certain areas, and guide you to resources (and links) that you can explore to learn more and do more beyond the scope of this book. The R computer language has expanded far beyond statistics, and allows you to do many powerful things to improve your workflow, make amazing graphics, and share results with others.

1.8 Some Guideposts

Keep an eye out for helpful Guideposts, which look like this:

Warnings

This is a common syntax error, especially for beginners. Watch out for this.

Tips

This is a helpful tip for debugging.

Try It Out

Take what you have learned and try it yourself in the code box below.

Challenge - take the next step and try a more challenging example.

Try this more complicated example.

Explore More - resources for learning more about a particular topic.

If you want to learn more about Shiny apps, go to https://mastering-shiny.org to see an entire book on the topic.

1.9 Helpful Tools

Throughout this book you will find code examples and demonstrations, and interactive exercises in which you can practice writing R code right in the book. Let’s explain how to use these demonstration flipbooks and learnr exercises.

1.9.1 Demonstrations in Flipbooks

Flipbooks are windows in this book in which you can watch R code being built into pipelines, and see the results at each step. Each flipbook demonstrates some important code concepts, and often new functions in R. You can click on the window to activate it, then use the left and right arrow keys to go forward and back in the code, one step at a time. You will want to go through these slowly, and make sure that you understand what is happening in each step. You may even want to take notes, particularly on the function syntax, as you will likely coding exercises with these functions shortly after the flipbook demonstration.

Take a look at the example of a flipbook below.
Activate it by clicking on it, and step through the pipeline of code with the right and left arrow keys. Watch the results of each step.

1.9.2 Learnr Coding Exercises

Learnr coding exercises are windows in this book in which you can write your own R code to solve a problem. Each learnr exercise tests whether you have mastered important code concepts, and often new functions in R. If needed, you can reset to a fresh code window with the Start Over button. You can type lines of code into the window, then click on the Run Code button at the top right to run the code and get your results. Your code may not produce the right result the first time, and you will have to interpret the error message to figure out how to fix it. Rely on the text and your notes and the demonstrations to help you. If you are stuck, you can click on the Hint button to see an example of correct code, and compare it to your own. If you would like, you can even copy this code to the clipboard with the Copy button and

Take a look at the example of a learnr exercise below.
There is a dataset piped into a series of functions (‘verbs’), with a blank. Fill in the blank with ‘p_vol’ (without the quotes), which stands for the variable prostate volume. Then run your code with the Run Code button to get a result. Practice using the Start Over button, the Hint button (there may be more than one - usually the last one is the solution), and the Copy To Clipboard button.

When you get a table of data as a result from a code pipeline, it may have more columns (variables) than can be displayed easily. When this is the case, there will be a black arrow pointing rightward at the top right of the table of results. Click on this to scroll right and see more columns.
A table of data as a result from a code pipeline may also have more rows (observations) than can be displayed easily. When this is the case, the table will be paginated, with 10 rows per page. At the bottom right of the table, there will be a clickable listing of pages, along with Previous and Next buttons. Click on these buttons (or the page number buttons) to see more pages of data to inspect your results.

An important note on coding: you should always have an internet search window open when you are writing code. No one can remember every function, nor the correct arguments and syntax of each function. A critical skill in writing code is searching for how to do something correctly. This is not a sign of weakness. Professional programmers google “how do I do x?” hundreds of times a day. This is how programming is done. You will often search for things like “how do I do x in R?” or “how to x in tidyverse”. This is completely normal, and to be expected. You do not have time to memorize hundreds of functions, and you may have days or even weeks between coding sessions (because of your day job), making it hard to remember all the details from your last coding session. This is not a problem. There are lots of websites that can help you solve specific problems, as you will find in the How to Find Help chapter.

2 Getting Started and Installing Your Tools

One of the most intimidating parts of getting started with something new is the actual getting started part. Don’t worry, I will walk you through this step-by step.

2.1 Goals for this Chapter

  • Install R on your Computer
  • Install RStudio on your Computer
  • Install Git on your Computer
  • Get Acquainted with the RStudio IDE

2.3 Pathway for this Chapter

This Chapter is part of the TOOLS pathway. Chapters in this pathway include

  • Getting Started and Installing Your Tools
  • Using the RStudio IDE
  • Updating R, RStudio, and Your Packages
  • Advanced Use of the RStudio IDE
  • When You Don’t Want to Update Packages (Using renv)
  • Major R Updates (Where Are My Packages?)

2.4 Installing R on your Computer

R is a statistical programming language, designed for non-programmers (statisticians). It is optimized to work with data in rectangular tables of rows (observations) and columns (variables). It is a very fast and powerful programming engine, but it is not terribly comfortable or convenient. R itself is not terribly user-friendly. It is a lot like a drag racing car, which is basically a person with a steering wheel strapped to an airplane engine.

drag racer

Very aerodynamic and fast, but not comfortable for the long run (more than about 8 seconds). You will need something more like a production car, with a nice interior and a dashboard, and comfy leather seats.

dashboard

This equivalent of a comfy coding environment is provided by the RStudio IDE (Integrated Developer Environment). I want you to install both R and RStudio, in that order.

Let’s start with installing R.
R is free and available for download on the web. Go to the r-project website to get started.

This screen will look like this irproject

You can see from the blue link (download R) that you can use this link to download R, but you will be downloading it faster if you pick a local CRAN mirror.
You might be wondering what CRAN and CRAN Mirrors are. Nothing to do with cranberries, fortunately. CRAN is the Comprehensive R Archive Network. Each site (mirror) in the network contains an archive of all R versions and packages, and the sites are scattered over the globe. A CRAN Mirror maintains an up to date copy of all of the R versions and packages on CRAN. If you use the nearest CRAN mirror, you will generally get faster downloads.


At this point, you might be wondering what a package is…
A package is a set of functions and/or data that you can download to upgrade and add features to R. It is like a downloadable upgrade to a Tesla vehicle that lets you play the video game Witcher 3 on your console, but more useful.

Another useful analogy for packages is that they are like apps for a smartphone. When you buy your first smartphone, it only comes with the basic apps that allow it to work as a phone, but a notepad and a calculator.

If you want to do cool things with your smartphone, you download apps that allow your smartphone to have new capabilities. That is what packages do for your installation of R.

tesla

Now let’s get started. Click on the blue link that says “download R”.
This will take you to a page to select your local CRAN Mirror , from which you will download R.

cran

Scroll down to your local country (yes, the USA is at the bottom), and a CRAN mirror near you. This is an example from the state of Michigan, in the USA.

usa-mirrors

Once you click on a CRAN Mirror site to select the location, you will be taken to the actual Download site.

install

Select the link for the operating system you want to use. We will walk through this with Windows first, then Mac. If you are using a Mac, skip forward to the Mac install directions. If you are computer-savvy enough to be using Linux, you can clearly figure it out on your own (it will look a lot like these).

2.5 Windows-Specific Steps for Installing R

If you are installing R on a Mac, jump ahead to the Mac-specific version below.

On windows, once you have clicked through, your next screen will look like this:

install2

You want to download both base and Rtools (you might need Rtools later). The base link will take you to the latest version, which will look something like this.

install3

Click on this link, and you will be able to save a file named R-N.N.N-win.exe (Ns depending on version number) to your Downloads folder. Click on the Save button

to save it.

install4

Now, go to your Downloads folder in Windows, and double click on the R installation file (R-N.N.N-win.exe). Click Yes to allow this to install.

install5exe

Now select your language option.

install_language

You will be asked to accept the GNU license - do so. Click Yes to allow this to install. Then select where to install - generally use the default- a local (often C) drive - do not install on a shared network drive or in the cloud.

install_drive

Then select the Components - generally use the defaults, but newer computers can skip the 32 bit version.

install_comp

In the next dialog box, accept the default startup options.

install_defaults

You can choose the start menu folder. The default R folder is fine.

install_start

If you want a shortcut icon for R on your desktop, you can leave this checked. But most people start RStudio, with R running within RStudio, rather than directly starting R. You probably won’t need an R shortcut, so leave these unchecked in the next dialog box.

install_addltasks

Then the Setup Wizard will appear - click Finish, and the rest of the installation will occur.

install_wizard

2.5.1 Testing R on Windows

Now you want to test whether your Windows installation was successful. Can you find R and make it work? Hunt for your C folder, then for OS-APPS within that folder. Keep drilling down to the Program Files folder. Then the R folder, and the current version folder within that one (R-N.N.N). Within that folder will be the bin folder, and within that will be your R-N.N.N.exe file. Double click on this to run it. The example paths below can help guide you.

install_path2

install_path

Opening the exe file will produce a classic 2000-era terminal window, called Rterm, with 64 bit if that is what your computer uses. The version number should match what you downloaded. The messaging should end with a “>” prompt.

install_term

At this prompt, type in:

paste(‘Two to the seventh power is’, 2^7)

(don’t leave out the comma or the quotes) - then press the Enter key.

This should produce the following:

Two to the seventh power is 128

install_test

Note that you have explained what is being done in the text, and computed the result and displayed it.

2.6 Mac-specific Installation of R

The installation for Mac is very similar, but the windows look a bit different. If you are working with Windows, jump ahead at this point to Installing RStudio. At the Download Version page, you click on the Mac Download. You will then click on the link for R-N.N.N.pkg, and allow downloads from CRAN.

install_path

Then go to Finder, and navigate to the Downloads folder. Click on R-N.N.N.pkg You will then click on the link for R-N.N.N.pkg, and allow downloads from CRAN.

install_downloadmac

Click on Continue on 2 consecutive screens to download

cont1_mac

cont2_mac

Then you need to agree with the License Agreement,

mac_license

then Click on Install, and provide your Mac password for permission to install.

cont1_mac

When the installation is complete, click on the Close button. Accept the prompt to move the installer file to the trash.

2.6.1 Testing R on the Mac

Go to Finder, and then your Applications folder. Scroll down to the R file. Double click on this to run it.

findrmac

You should get this 2000-era terminal window named R Console. The version number should match what you downloaded, and the messaging should end with a “>” prompt. At this prompt, type in

paste(“Two to the seventh power is”, 2^7)

(DON’T leave out the comma or the quotes)

rconsolemac

This should result in

mactestR

2.6.2 Successful testing!

Awesome. You are now Ready to R!

ready2R

2.7 Installing RStudio on your Computer

Now that R is working, we will install RStudio. This is an IDE (Integrated Development Environment), with lots of bells and whistles to help you do reproducible medical research.

teslax_dash

This is a lot like adding a dashboard with polished walnut panels, a large video screen map, and heated car seats with Corinthian Leather. Not absolutely necessary, but nice to have.

The RStudio IDE wraps around the R engine to make your experience more comfortable and efficient.

camry_dash

Fortunately, RStudio is a lot cheaper than any of these cars. In fact, it is free and open source. You can download it from the web at:

rstudio

Click on the RStudio Desktop icon to begin.

download

This will take you to a new site, where you will select the Open Source Edition of RStudio Desktop

open_source

This will take you to a new site, where you will select the Free Version of RStudio Desktop

free

Now select the right version for your Operating syxtem - Windows or Mac.


mac_win

2.7.1 Windows Install of RStudio

If you are installing on a Mac, jump ahead now to the Mac-specfic installation instructions.

Now save the RStudio.N.N.N.exe file (Ns will be digits representing the version number) to your downloads folder.

winsave

Now go to your downloads folder, and double click on the RStudio.N.N.N.exe file.

winlaunch

Allow this app to make changes. Click Next to Continue, and Agree to the Install Location.

wininstall

Click Install to put RStudio in the default Start Menu Folder, and when done, click the Finish button.

winsave

winfinish

Now select your preferred language option, accept the GNU license, Click Yes to allow this to install. Select where to install. This is generally on a local (often C:) drive, and usually not a shared network drive or in the cloud.

2.7.2 Testing Windows RStudio

Now you should be ready to test your Windows installation of RStudio.

Open your Start menu Program list, and find RStudio.

Pin it as a favorite now.

Click to Open RStudio.

Within the Console window of RStudio, an instance of R is started up. Check that the version number matches the version of R that you downloaded.

Now run a test at the prompt (“>”) in the Console window. Type in

paste("Three to the 5th power is", 3^5)

do not leave out the quotes or the comma

Then press the enter key

and this should be your result:

test_result35

A successful result means that you are ready to roll in RStudio and R!

ready

2.7.3 Installing RStudio on the Mac

Start at this link: RStudio Download

Select the Free RStudio Desktop Version

mac_download

Then click on the big button to Download RStudio for Mac.

mac_download2

After the Download is complete, go to Finder and the Downloads Folder. Double click on the RStudio.N.N.N.dmg file in your Downloads folder.

mac_dmg

This will open a window that looks like this

mac_apps

Use your mouse to drag the RStudio icon into the Applications folder.

Now go back to Finder, then into the Applications folder. Double click on the RStudio icon, and click OK to Open.

Pin your RStudio to the Dock.

Double Click to run RStudio.

RStudio will open an instance of R inside the Console pane of RStudio with the version number of R that you installed, and a “>” prompt.

2.7.4 Testing the Mac Installation of RStudio

Type in

paste("Three to the 5th power is", 3^5)

do not leave out the quotes or the comma

Then press the enter key

and this should be your result.

test_result35

A successful result means that you are ready to roll in RStudio and R!

ready

2.8 Critical Setup - Tuning Your RStudio Installation

You now have 6+ adjustments that you need to make in your RStudio Global Settings for optimal R and RStudio use.

  1. At this point, it is a good idea to jump out of RStudio and create an “Rcode” folder on your computer, in a place that is easy to find, often at the top level in your Documents folder, to make all of your future projects easy to find.

Once this Rcode folder is in place, switch back to RStudio. In the RStudio Menus, go to Tools/Global Options. A new Global Options window will open up. Click on the General tab on the left. At the top, there is a small window for identifying your Default working directory. Click on the Browse button, and browse to your new “Rcode” folder and select it. From now on, your R files and Projects will all be in one place and easy to find.

  1. In the same General tab, de-select the first 3 options
  • turn off Restore most recently opened project at startup
  • turn off Restore previously open source documents at startup
  • turn off Restore .RData into workspace
  1. In the same General tab, find Save workspace to .RData on exit. Click on the dropdown menu to select “Never”

These tune-ups (#2 and #3) to your RStudio will mean you will always start with a clean workspace in a new RStudio session, which will avoid a lot of potential problems later.

  1. In the same General tab, at the top, click on the Advanced tab. Then select the box for
  • Show full path to project in window title This will show your working directory at the top of your Console Pane. This can prevent confusion and problems later.
  1. On the left, click on the Rmarkdown tab. Then de-select the option for
  • Show output inline for all Rmarkdown documents.

This will put your temporary output from Code Chunks into the larger and nicer Viewer tab.

  1. Take a look at the Appearance tab. You can change your code font, the font size, and the theme. I wouldn’t make any drastic changes at this point, but it is good to know that these options are available. Any changes here are entirely optional (and cosmetic) at this point.

  2. in the RStudio menus, select Code, then check/select two options to turn these on:

  • Soft Wrap Long Lines - so that your code does not get too wide
  • Rainbow Parentheses - color-codes parentheses so that you can keep track of whether you have closed all of your open parentheses (a common source of errors)

Now your RStudio installation is tuned and ready to go!

2.9 Installing Git on your Computer

The software program, git, is a version control system. It is the most common version control system in the world. It is free and open source, and is the foundation of reproducible computing.

We won’t be doing a lot with git just yet, but it is helpful to get this installation done and out of the way. It will come up a lot when we start to discuss reproducible research and collaboration.

2.9.1 Installing Git on macOS

If you are using Windows, jump ahead to Installing Git on Windows.

  1. The easiest approach on the macOS is to go to the Terminal tab in the Console pane (lower left) in RStudio. A prompt will appear that ends in a $.

At that prompt, type git --version

note that there are 2 dashes before version.

This will tell you the current version of git (2.29.2 as of January 1, 2021), or prompt you to install git.

  1. If you want the current version of git, you can install this yourself.

a. First, let’s check if you have homebrew installed.
Go to the Terminal tab in the Console pane (lower left) in RStudio. A prompt will appear that ends in a $.

at the prompt, type command -v brew

This should return “/usr/local/bin/brew” if homebrew is installed, or will tell you “brew not found” or something similar.

b. Installing homebrew

At the terminal prompt($), paste in the following:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Then press Enter to run it. This installs the homebrew program, which allows you to install software on macOS that does not come from the Apple App Store

This will take a couple of minutes.

c. Installing git

Once you have homebrew installed, installing git is straightforward.

At the Terminal prompt ($), type

brew install git

and this will quickly install. You will be prompted to click Continue buttons to complete the installation.

  1. Check your installation.

    At the Terminal prompt ($), type

    git --version

    and this should return a result like “git version 2.29.2”, depending on the version number.

2.9.2 Installing Git on Windows

If you are using Windows, go to the website, https://git-scm.com/download/win.

  1. This will start the download automatically

  2. Go to your downloads folder and install the downloaded .exe file by clicking on it

  3. Check your installation.

    At the Terminal prompt ($), type

    git --version

    and this should return a result like “git version 2.29.2”, depending on the version number.

2.9.3 Installing Git on Linux

If you are using Fedora, or a related version of Linux like RHEL or CentOS, use dnf

At the $ prompt, type sudo dnf install git-all

If you are using a Debian-based version of Linux like Ubuntu, use apt

At the $ prompt, type sudo apt install git-all

For other distributions of Linux, follow the instructions at https://git-scm.com/download/linux.

  1. Check your installation.

    At the Terminal prompt ($), type

    git --version

    and this should return a result like “git version 2.29.2”, depending on the version number.

2.10 Getting Acquainted with the RStudio IDE

When you first open the RStudio IDE (Integrated Development Environment), there will be a left side pane, with tabs for Console, Terminal, Rmarkdown, and Jobs.

Just for fun, go to the RStudio menus, and choose File/New File/RScript. This will open a new pane at the top left, which we will call Q1 (quadrant 1), or the top left pane, or the Source pane. This pane will contain tabs for each active script or document, along with tabs for any datasets you have opened up to have a look at.

The Quadrant 2 pane, with tabs for Console, Terminal, Rmarkdown, and Jobs, has now been pushed to the lower left pane. You will use the Console for interactive programming, and as a “sandbox” to test out new code. When your code works and is good enough to save, you will move it to the Source pane and save it to a Script or an Rmarkdown document. Any code that is not saved to Source will be lost (actually it will be somewhere in the History, but it can be a pain to find the version that works later - it is best to save the good stuff to a Script or .Rmd).

The Quadrant 3, or top right pane, includes tabs for your Environment (objects, like datasets, functions, and variables you have defined), History (saving the past in case you forget to, but messy), and Connections tabs for connections to databases. Later a Git tab will be added for version control (backup) of your Source documents.

In Quadrant 4, or the bottom right pane, you will find tabs for your Files, Plots, Packages, Help, and a Viewer for HTML output.

This is material that is also well described in the “Basic Basics 1” section of RLadiesSydney. Check it out at BasicBasics1. There is a nice ~ 15 minute video by Jen Richmond worth watching if you are just getting started. Note that a lot of the other material on this website (RYouWithMe) is very helpful for people new to R.

3 A Tasting Menu of R

In this chapter, we will introduce you to a lot of neat things that you can do with R and RStudio, and you will publish a simple data analysis on the Internet that you can share with friends and family. tasting

3.1 Setting the Table

At the end of this chapter, you will publish a data analysis to RPubs, a free website site where you can share your data analyses and visualizations. First you will need to set up an account on RPubs. Start by opening a new tab in your browser, and navigating to this RPubs link. It should look like the image below. rpubs1

Enter your name, email, username and password, and click on the Register Now button, and you will be set up to use RPubs.
This will bring you to this pag e. In the image below, we have set up an account for pd r.
rpubs2 Click on the Here’s How You Get Started l ink. rpubs3
You are now all set up and ready to go. Now you have a place on the internet to share your R crea tions. On to the creation part!

3.2 Goals for this Chapter

  • Open a New Rmarkdown document
  • Read in Data from a file
  • Wrangle Your Data
  • Visualize Your Data
  • Publish your work to RPubs
  • Check out Interactive Plots
  • Check out Animated Graphics
  • Check out a Clinical Trial Dashboard
  • Check out a Shiny App

3.3 Packages needed for this Chapter

You will need to enter this line of code into your console, to make sure that the tidyverse package is installed on your computer. install.packages("tidyverse")

In the setup chunk of your Rmarkdown document, you will need to access the tidyverse package with one line of code: library(tidyverse)

3.5 Pathway for this Chapter

This Chapter is part of the XXX pathway. Chapters in this pathway include

3.6 Open a New Rmarkdown document

Let’s get started in R. Turn on your computer, and open the RStudio application. You should see the familiar panes for the Console, Environment, and Files.
You need to open up a new document to activate the Source pane. While in RStudio, click on File/New File/RMarkdown. It should look like this.
newrmd Now you will see the window belo w. Rename the document from “Untitled” to “Tasting”, Enter your own name as the Author, and click the OK butto n. newrmd2

Now the file is open, and looks like the window below. Click on the save icon (like a floppy disk in the top left), and save this document as tasting.Rmd. newrmd3

You have created a new Rmarkdown document. An Rmarkdown document lets you mix data, code, and descriptive text. It is very helpful for presenting and explaining data and visualizations. An Rmarkdown document can be converted (Knit) to HTML for a web page, Microsoft Word, Powerpoint, PDF, and several other formats.

Code chunks are in a gray color, and both start and end with 3 backticks (```).

code goes here

Text can be body text, or can be headers and titles. The number of hashtags before some header text defines what level the header is.
You can insert links, images, and even YouTube videos into Rmarkdown documents if it is helpful to explain your point.

The first code chunk in each Rmarkdown document is named setup. The name comes after the left curly brace and the r ({r) at the beginning of the setup chunk. The letter r tells RStudio that what is coming on the next line is R code (RStudio can also use SQL, C++, python, and several other languages). After the comma, you can define options for this code chunk. In this case, the option include is set to FALSE, so that when this Rmarkdown document is knitted, this code chunk will not appear.

3.7 Read in Data from a file

We will start by reading in some data from the {medicaldata} package.

3.7.1 Installing Packages

Before we begin, you have to have a few R packages installed on your computer.

Go to your Console tab, and type in (or copy and paste in) the following 3 lines (below the {r} ):

install.packages('tidyverse')
install.packages('janitor')
install.packages('medicaldata')

Press Enter to run these functions. These will install the 3 packages, {tidyverse}, {janitor}, and {medicaldata}. Installing packages is like buying apps for your phone. But these apps are not loaded unless you tell R and RStudio that you want them loaded in the current session. You do this with the library() function.

3.7.2 Loading Packages with library()

Copy and paste to add the following 4 lines (below the {r} line) to your setup chunk in your “Tasting.Rmd” Rmarkdown document:

library(tidyverse)
library(janitor)
library(medicaldata)
prostate <- medicaldata::blood_storage %>% clean_names()

These functions will load 3 packages and reads in data from a study of prostate cancer and blood storage into the prostate object.

To run these functions, click on the green rightward arrow at the top right of the setup code chunk.

The {tidyverse} package (it is actually a meta-package that contains multiple packages) will be quite chatty, telling you which packages are being attached, and when conflicts with identically-named functions in the {stat} package have occurred. When you call these functions, filter() and lag(), the versions from the {tidyverse} package will be used by default, and the versions from the {stats} package will be masked.

The {janitor} package will tell you that it has 2 conflicts with the {stats} package, and will supercede (mask) the {stats} functions for chisq.test() and fisher.test().

If you really want to access the versions from the {stats} package, you can do so by using the package::function construction, e.g. stats::chisq.test().

If you check the Environment tab in the top right pane of RStudio, you will find that you now have a prostate object under the Data header. You can click on the white-on-blue arrow to the left of the word prostate to get an overview of each variable, the variable type (numeric, string, etc.), and the first few values of each variable.

You can also click on the word prostate in the Environment window to open up a View of the whole dataset in the Source pane (top left). You can scroll up and down the rows, or right and left in the columns to inspect the data.

If you check the Console tab (lower left), you will see that when you clicked on prostate, this sent a function to the console to View(prostate). You can view any dataset in the Environment tab with this function.

You can also look at your data in the Console, with

summary(prostate) or

glimpse(prostate)

Underneath the setup chunk, write something about the prostate dataset. You can write in Normal text, and add headers by starting a line with 2 hashtags, a space, and text like this

## Headline about Prostate data

Write a few sentences after your headline. You can add italics or bold text by wrapping the text to be highlighted in underscores or 2 asterisks, respectively.

3.8 Wrangle Your Data

Add a new code chunk

name it

wrangle it - select, filter

3.9 Visualize Your Data

first ggplot

3.10 Publish your work to RPubs

push to RPubs

You did it.

Share website with others

3.11 The Dessert Cart

Below are some examples of neat things you can do with medical data in R. These are more advanced approaches, but completely doable when you have more experience with R.

3.11.1 Interactive Plots

3.11.2 Animated Graphics

3.11.3 A Clinical Trial Dashboard

3.11.4 A Shiny App

3.11.5 An Example of Synergy in the R Community

One of the remarkable things about the open source R community is that people build all kinds of new R functions and packages that are useful to them, and then share them publicly with tools like Github so that they can be useful to others. Often combining bits of several packages leads to emergent properties - completely new creations that can only occur because all of the parts (packages) are present. The collaborative nature of the R community, in this case on Twitter (follow the #rstats hashtag), can lead to surprising collaborations and outcomes.
Go ahead and play the example below, which uses rayrendering (all coded entirely in R) to show a 3D map of John Snow’s cholera case data in 1854, which led him to identify the Broad Street water pump as the source of the cholera outbreak, and led to the removal of the pump handle and the end of outbreak.

If you are not familiar with John Snow and the Broad Street pump, there is a fun series of YouTube animations (parts 1-3 and an epilogue) to explain the history. Start by clicking here.

4 Wrangling Rows in R with Filter

In this chapter, we will introduce you ways to wrangle rows in R. You will often want to focus your analysis on particular observations, or rows, in your dataset. This chapter will show you how to include the rows you want, and exclude the rows you don’t want. Once your data wrangling and data validation is done, you will be ready for data analysis.

4.1 Filtering on Numbers - Starting with A Flipbook

If you have not used a flipbook before, you can click on the frame below to activate it, then use right and left arrow keys to move forward and back through the demo.

With each forward step in the code on the left, examine the resulting output on the right. Make sure you understand how the output was produced.

4.1.1 Your Turn - learnr exercises

4.2 Filtering on Multiple Criteria with Boolean Logic

You can use multiple filters on your data, and combine these with AND OR XOR parentheses and combinations thereof.

4.2.1 Your Turn - learnr exercises

4.3 Filtering Strings

You can use == to test exact equality of strings, but you can also use str_detect from the {stringr} package, and combine it with the magic of regex to do complicated filtering on character string variables in datasets.

4.3.1 Your Turn - learnr exercises

4.4 Filtering Dates

You can use the {lubridate} package to format strings for logical tests, and filter your observations by date, month, year, etc.

4.4.1 Your Turn - learnr exercises

4.5 Filtering Out or Identifying Missing Data

You can use the is.na(), drop_na() and negation with ! to help identify and filter out (or in) the missing data, or observations that are incomplete.

4.5.1 Your Turn - learnr exercises

4.6 Filtering Out Duplicate observations

You can use the {janitor} package to help you find duplicated observations/rows for fixing or removal from your dataset.

4.6.1 Your Turn - learnr exercises

4.7 Slicing Data by Row

You can use the slice() family of functions to cut out a chunk of your observations/rows by position in the dataset.

4.7.1 Your Turn - learnr exercises

4.8 Randomly Sampling Your Rows

You can use the slice_sample() function to take a random subset of large datasets, or to build randomly selected training and testing sets fo modeling.

4.8.1 Your Turn - learnr exercises

5 Interpreting Error Messages

Especially when you are starting out, it can be very difficult to interpret error messages, because these can be quite jargon-y.

Let’s start with a table of the most common error messages, and the likely cause in each case.

Note that when reading an error message, there are two parts - the part before the colon, which identifies in which function the error occurred, and the part after the colon, which names the error. A typical error message is usually in the format:

Error in Where the error occurred : what the error was

here is an example

Error in as_flextable(.) : object 'errors' not found

On the left, you are being told that the error occurred when the as_flextable() function was called. This can be helpful if you have run a long pipeline of functions, as it helps you isolate the problem.

On the right, you are being told what the error was. In this case, the function looked for the object errors in the working environment (see your Environment tab at the top right in RStudio), and could not find it.

Note that sometimes syntax errors caused by missing components (a missing comma, a missing parenthesis, a missing pipe symbol %>% , or a missing + sign in a ggplot pipe) will cause an error in the next function in the pipeline. Watch out for this, especially when the function where the error is found looks fine - often it occurs because there is a missing piece just before this function.

Then we will walk through examples of how to create each error, and how to fix them, one by one.

5.1 The Common Errors Table

Examine the error message from R, particularly the part that comes after the colon (:). The error messages listed in the left column will be what appears after the colon (:)

Common Error Messages in R
Error Message What it Means
could not find function

This usually means that you made a typographical error in the function name (including Capitalization - R is case-sensitive), or that the package you are intending to use (which contains the function) is not installed - with `install.packages(‘package_name’)`

or loaded - with `library(package)`

object ‘object-name’ not found

This usually means that the function looked for an object (like a data frame or a vector) in your working environment (check your Environment pane) and could not find it. This commonly happens when you

  1. mistype the name of the object (double-check this, easy to fix), or

  2. you did not actually create or save this object to your working environment - confirm by checking your Environment tab at the top right in RStudio.

filename does not exist in current working directory (‘path/to/working/directory’)

This usually means one of three things: (1) you mistyped the name of the file, or part of the path,

  1. you are not in the directory where the file is, or

  2. the file you thought you had saved does not exist (check your Files tab in the lower right pane in RStudio).

error in if This usually means that you have an *if* statement that is trying to make a branch-point decision, but the logical statement that you wrote is not providing either a TRUE or a FALSE value. The most common reasons are typographical errors s in the logical statement, or an NA in one of the underlying values, which yields an NA from the logical statement. You may need to use a `na.rm = TRUE` option in your logical statement.
error in eval This usually occurs when you are trying to run a function on an object that does not exist in your environment. Check to make sure in your Environment pane, and consider that you may not have saved/assigned the object. Alternatively, you may have a typographical error in the object name. Worth checking.
cannot open This usually occurs when you are attempting to access or read a file that either does not exist, or is not in the folder that you thought it was. Check your working directory and find the file in your file structure. This can often be prevented by working in RStudio projects and using the here() function for paths to files.
no applicable method This usually occurs when you are using a function that expects a particular data structure (vector, list, dataframe), but you have given it a different data structure as the input. Check the data structure of your object, and check the documentation for your function. For example, if you want to use a function that acts on vectors, this function will not work on a dataframe variable. You may have to use the `pull(var)`function to “pull” this variable out of the dataframe into a vector before using this function.
subscript out of bounds you are trying to access an item in an environment object (like a vector, dataframe, or list) that does not exist, like the 9th item in a vector that is 7 items long, or the -1st row of a dataframe. Check the length of the item, and the math that you used to count the item number (loops that go too long are often a culprit)
replacement has [x] rows, data has [y] rows This usually occurs when you are trying to code for a new variable, or replace a variable in a dataframe. But somehow (missing values, NAs), what you are trying to add to the dataframe is not the same length (number of rows) as the rest of the existing dataframe. Use a length() function to check your building of this vector at each step, to figure out where your length went wrong.
package not available for R version x.y.z This occurs when you are trying to install a package, and your R version is newly updated. The problem is that the package version available on CRAN has not caught up to your shiny new version of R. This can happen after an R update when the package developer is working on updating their package, but the new version has not made it onto CRAN yet. This is often fixable if you know where the developer stores their development code (usually on GitHub). For example, if the package is {medicaldata}, and the developer’s Github userid is higgi13425, then you can install the development version of this package with remotes::install_github('higgi13425/medicaldata'). This assumes that you have already installed and loaded the {remotes} package.
non-numeric argument to a binary operator A binary operator, like + or *, is a mathematical operation that takes two values (operands) and produces another value. It gets grumpy when trying to do math on things that are not numbers. A typical input to produce this error would be 1 + 'one' - one operand is numeric, and the other 'one' is a character string - the non-numeric argument.
object of type closure is not subsettable This occurs when you try to extract a subset of something - but it is actually a function, not an object. This most commonly occurs when you try to subset a particular object that does not exist, like df$patient_id or data$sbp, when you have not created the objects df or data. The reason you get this strange error message, rather than simply Error: object 'df' not found , is that df() and data() are defined functions in base R. It is good practice to avoid naming any objects data or df for this reason. It gets very confusing, and this is best avoided.

5.2 Examples of Common Errors and How to fix them

5.2.1 Missing Parenthesis

This is a very common error. It is easy to lose track of how many sets of parentheses you have open in putting together a complicated function.

Here is an example, where a closing parenthesis is missing from a mutate() function.

prostate %>% 
  select(t_vol, p_vol, age, aa) %>% 
  mutate(ratio = t_vol/p_vol,
         older_aa = case_when(age >65 & aa == 1 ~ 1,
                              TRUE ~0) %>% 
  filter(older_aa ==1)

In this case, no output is produced, and the console does not return to the > prompt. Instead, it offers a + prompt - in effect, asking you for something more. If you type in an extra closing parenthesis (after the filter function), it will give you an error.

The error you get is:

Error: Problem with `mutate()` input `older_aa`. x no applicable method for ‘filter_’ applied to an object of class “c(‘double’, ‘numeric’)” ℹ Input `older_aa` is ``%>%`(…)`.

R identifies a problem with the input “older_aa” to mutate - the parentheses are not closed.
It then fails on the next function - filter, and gives you a strange error message - filter_ applied to… - because the input to the filter step (the next step after the error) was incoherent. This can be a bit confusing. But if you inspect the input older_aa, you will find the mis-matched parentheses. This is much easier to find with “rainbow parentheses” turned on in Tools/Global Options. When this option is on, you can be sure your parentheses are right when you end on red.

In this case, adding the missing parenthesis to the mutate step fixes it.

Parentheses that end on red are all right.

5.2.2 An Extra Parenthesis

What if you go the other way, with an extra parenthesis after some misguided copy-paste adventures? Let’s see what happens.

prostate %>% 
  select(t_vol, p_vol, age, aa) %>% 
  mutate(ratio = t_vol/p_vol,
         older_aa = case_when(age >65 & aa == 1 ~ 1,
                              TRUE ~0))) %>% 
  filter(older_aa ==1)

In this code block, you will end up with two red closing parentheses, and when you click to the right of the final closing parenthesis, there will be no matching highlighted open parenthesis (note that the preceding closing parentheses both have matching highlighted open parentheses. Both of these are clues that this last one is an extra.

The error you get from R is

Error in filter(older_aa == 1) : object ‘older_aa’ not found

The left side of the error message identifies the filter step as where the error occurs, and the right side of the error message states that the error is an object not found. The error occurs when R gets to the next function. It also tells you that older_aa was not successfully created - suggesting that the problem is in the step before the filter function.

In this case, removing the extra parenthesis from the mutate step fixes it.

5.2.3 Missing pipe %>% in a data wrangling pipeline

This is a common error. It is easy to cut out one of your %>% connectors when you are editing/debugging a data wrangling pipeline.

Here is an example, where a %>% is missing. Can you spot it?

prostate %>% 
  select(t_vol, p_vol, age, aa)  
  mutate(ratio = t_vol/p_vol,
         older_aa = case_when(age >65 & aa == 1 ~ 1,
                              TRUE ~0)) %>% 
  filter(older_aa ==1)

In this case, the error you get is:

Error in mutate(ratio = t_vol/p_vol, older_aa = case_when(age > 65 & aa == : object ‘t_vol’ not found

The left side of the error message identifies the mutate step as where the error occurs, and the right side of the error message states that the error is an object not found. This is a bit misleading, as the problem is not in the mutate step. But mutate is where the pipeline crashes, as it can not find the variable t_vol. You have to backtrack upwards line-by-line to find the error. Every line of a data wrangling pipeline should end in %>%. Since this is such a common error, this should be one of your “usual suspects”. And the select line, just above the mutate line, is where the problem is.

In this case, adding the missing %>% to the end of the select step fixes your data wrangling pipeline.

Use one function per line in a pipeline.
Check every data wrangling pipeline to make sure each step (except the last) ends in a pipe %>%

5.2.4 Missing + in a ggplot pipeline

This is a common error. It is easy to cut out one of your + connectors when you are editing/debugging a ggplot.

Here is an example, where a + is missing in the middle of a ggplot pipeline.

prostate %>% 
  select(t_vol, p_vol, age, aa) %>% 
  ggplot(aes(x = factor(t_vol), y =p_vol)) 
  geom_boxplot() +
  labs(x = "tumor volume", y = "prostate volume") +
  theme_minimal()

In this case, you get a ggplot output, but without any boxplots. It is also missing your custom labels for the x and y axes, and the theme you wanted. Essentially, the code stops running after the initial ggplot() statement and the remaining lines of code are ignored. This can be pretty puzzling, as you do get a plot, but not what you intended. There is a partial plot in the Plots tab, but you get a somewhat helpful error in the Console.

The error you get is:

Error: Cannot add ggproto objects together. Did you forget to add this object to a ggplot object?

R identifies a problem with the last 3 lines of code, starting with geom_boxplot() - it can not add these ggproto objects (the components of a ggplot) to the existing plot. It asks, “Did you forget to add?” which should be a clue that there is a missing + sign between lines of ggplot code. Since the theme and labels are the defaults, and there are no boxplots, suggest that these last 3 lines were not run at all, and that the missing plus sign should be found just before these lines of code.

In this case, adding the missing + to the end of the ggplot step fixes your plot.

Use one function per line in a pipeline.
Check every ggplot pipeline to make sure each step (except the last) ends in a plus sign +

5.2.5 Pipe %>% in Place of a +

This is a common error. It is easy to start with your dataset, do some data wrangling steps with the pipe %>% and keep piping out of habit, even after you start your ggplot. Unfortunately, once you start to ggplot, you have to use + as your code connector. Having a pipe instead will cause an error.

Here is an example, where a %>% is used instead of + in a ggplot pipeline. It usually happens at the beginning of the ggplot, when you are still in piping mode.

prostate %>% 
  select(t_vol, p_vol, age, aa) %>% 
  ggplot(aes(x = factor(t_vol), y =p_vol)) %>% 
  geom_boxplot() +
  labs(x = "tumor volume", y = "prostate volume") +
  theme_minimal()

In this case, you will not get a ggplot output, and you will get an error in the console.

The error you get is:

Error: `mapping` must be created by `aes()` Did you use %>% instead of +?

The error message identifies the aes() step as where the error occurs. R identifies a problem that causes the aes function to fail to create a mapping. The first line is not very helpful (other than identifying aes() as a problem), but in the next line, R asks, “Did you use %>% instead of +?” which is very helpful. Once you know this, look at the line where aes() failed. This is where there is a pipe in place of a plus.

In this case, replacing the %>% with a + fixes your plot.

5.2.6 Missing Comma Within a Function()

This is a common error. It is easy to start a series of arguments to a function, like multiple variables in a mutate step, and miss a comma between them.

Here is an example, where a comma is missing in a series of mutate steps. Note that it is a good habit to put one mutate step on each line, with each line ending in a comma. This will help you find the missing comma if (no, when) you make this mistake.

prostate %>% 
  select(t_vol, p_vol, age, aa) %>% 
  mutate(ratio = t_vol/p_vol,
         older_aa = case_when(age >65 & aa == 1 ~ 1,
                              TRUE ~0)
         age_decade = floor(age / 10)) %>% 
  filter(older_aa ==1)

In this case, you will not get a tibble output, and you will get an error in the console.

The error you get is:

Error in filter(older_aa == 1) : object ‘older_aa’ not found

The left side of the error message identifies the filter step as where the error occurs, and the right side of the error message states that the error is an object not found. R identifies a problem that causes the filter function to fail, but this is actually a problem in the line prior. The variable older_aa was not created and is not available to filter. It should have been created in the mutate step, but this step is where the failure occurred. Because you formatted the mutate step with one mutate statement per line, it is easy to check each line for a comma - and the older_aa line is missing its comma.

In this case, adding a comma at the end of the older_aa line (after “TRUE ~0)” fixes your data wrangling pipeline.

5.2.7 A Missing Object

This is a common error. You may have created or modified a dataframe, but forgot to assign it to a new object name. Or maybe you did this assignment in a different session, but have not done it in your current session. Or maybe you made a typographical error in calling the object (“covvid” instead of “covid”). Either way, this object is not yet loaded into your computing environment (the Environment tab).

In this example, we request data from the {medicaldata} package, but forget to assign it to an object.

So it does not exist when we try to use it to start a pipeline. This does not work.

medicaldata::covid_testing

covid %>% 
  select(subject_id, age, result, ct_result, patient_class) %>% 
  mutate(high_titer = case_when(ct_result < 18,
                                    TRUE ~ 0),
         age_decade = floor(age / 10)) %>% 
  filter(age >50)

In this case, you will not get a tibble output, and you will get an error in the console.

The error you get is:

Error in select(., subject_id, age, result, ct_result, patient_class) : object ‘covid’ not found

The portion to the left of the comma identifies where the error occurs - in the select step. The portion to the right of the comma identifies the error. This one is easy. The object ‘covid’ was not found. You can check your Environment pane, and it will not be there. What the coder intended was to call medicaldata::covid_testing and assign it (with an arrow) to a new object named covid. But that assignment did not happen, and R is unable to guess what you meant.

In this case, adding an assignment arrow -> to the end of the medicaldata::covid_testing line and then covid completes the assignment, creates the covid object, and
fixes your data wrangling pipeline.

5.2.8 One Equals Sign When you Need Two

This is a very common error. The equals sign is commonly used in two ways in R.

  1. To assign a parameter or argument of a function, like x = p_vol, or ratio = p_vol/t_vol, or color = “blue”. In all of these assignment cases, you use one equals sign.
  2. To test a logical statement, like age == 60, or fam_hx == 1, or location == “Outpatient”. In all of these logical tests, you use two equals signs.

It is very common to use one equals sign in a logical statement. This causes errors. Watch the last filter step below.

prostate %>% 
  select(t_vol, p_vol, age, aa) %>% 
  mutate(ratio = t_vol/p_vol,
         older_aa = case_when(age >65 & aa == 1 ~ 1,
                              TRUE ~0),
         age_decade = floor(age / 10)) %>% 
  filter(older_aa =1)

In this case, the error you receive is very helpful:

Error: Problem with `filter()` input `..1`. x Input `..1` is named. ℹ This usually means that you’ve used `=` instead of `==`. ℹ Did you mean `older_aa == 1`?

The problem is with the filter step. The error starts out very jargon-y. “input `..1`. x Input `..1` is named” - means the input to filter is actually named (an assignment). But then it gets a lot more helpful. It recognizes that you have made a common error, and suggests an appropriate fix.

In this case, adding a 2nd equals sign in the filter step fixes your data wrangling pipeline.

Testing for equality with == is a big problem with real numbers, rather than integers. Computers use algorithms to do math which are not quite exact, leading to small differences in decimals. The == equality test is very strict, so that something like sqrt(2)^2 == 2 is FALSE because of small differences far to the right of the decimal point, which can trip you up. You can see these if you run the modulo 2: sqrt(2)^2 %% 2, which gives you the remainder after you divide by 2, which is the very tiny 0.0000000000000004440892. In this situation, you should use the near() function, as near(sqrt(2)^2, 2) is TRUE. The near function has a built-in tolerance of 0.00000001490116, which will be able to handle any computer-generated small, stray decimals. You can set your own tolerance argument if needed.

5.2.9 Non-numeric argument to a binary operator

This happens when you try to do math on things that are not numbers. It usually occurs when you have a variable(column) that looks like it is numeric (it contains numbers), but somewhere along the way it became a character string variable. This often occurs when data are being entered into a spreadsheet, and one value in the column has characters in it. This often happens when you have a column of systolic blood pressures, and one value is entered as “this was not done”, or “102, but taken standing up”. Having comments, even if only one character string in a column in Excel makes the whole column into the character string data type.

This is not apparent until you try to do math with this variable, as in

data %>% 
  mutate(mean_art_pressure = sbp/3 + 2/3* dbp)

This will give you the error:

Error in mutate(mean_art_pressure: non-numeric argument to binary operator

To fix this, you will have to

  1. Determine which variable, sbp or dbp, is non-numeric (glimpse(data) will help).

  2. Review the values of the problem variable (possibly with table()) to find which is non-numeric.

  3. Fix these values manually in your code, and document with comments

    1. Which values are being fixed (e.g. sbp for subject 007, at visit 2)
      data$sbp[subject == 007 & visit == 2] <- 102

    2. What the original value was, and what the new value will be

    3. Who made the change to the data

    4. Why the data change was made

    5. On what date the data change was made

    6. Never over-write your original data - keep a complete audit trail!

5.3 Errors Beyond This List

This is where the internet comes in handy. Whatever errors you can create, someone has already run into. And they have asked for help on the internet, and most of the time, someone has helped them solve their error.

You should copy your entire error message, and paste it into a web search. Google will often yield multiple similar examples, with various ways to solve the problem.

Remember that the error may have occurred because of a problem in the previous line of code (missing parenthesis, comma, etc.), so don’t forget to check one line above.

The Add-One-Line debugging strategy is a good place to start. Select the code for your pipeline from the beginning to 2 lines of code before the error. If that runs without errors, add one line to your selection, and run it. Keep adding lines to your selection and running until you hit the error. Then try to find the problem and fix it.

5.4 When Things Get Weird

5.4.1 Restart your R Session (Shift-Cmd-F10)

If you are running code that has worked before, and it is not working now, it is possible that you have created something odd in your working Environment that is interfering with your code. Sometimes it is an old object from a previous session (it is always better to start from a clean slate). Completely restart your R session (click on Session/Restart R, or use the keyboard shortcut), make sure the Environment is clean, then run your code from start to finish to give it a new try. Sometimes a clean slate will make all the difference.

6 Updating R, RStudio, and Your Packages

6.1 Installing Packages

The most important way to update R is to add packages. Each package adds new functions and/or data to R, enabling you to do much more in the R and RStudio environment.

When you open R, or start a new session, you have only the base version of R available, and it is pretty spartan. You can see how many packages you have available to you by starting RStudio and going to the menu Session/New Session, or Session/Restart R. Each of these will give you a clean workspace to start in. Once you have started a new session, or restarted R, run the following code:

print(.packages())
##  [1] "medicaldata" "forcats"     "stringr"     "dplyr"      
##  [5] "purrr"       "readr"       "tidyr"       "tibble"     
##  [9] "ggplot2"     "tidyverse"   "stats"       "graphics"   
## [13] "grDevices"   "utils"       "datasets"    "methods"    
## [17] "base"

You will find that you only have 9 packages available, including base, utils, methods, stats, graphics, grDevices, datasets, devtools, and usethis.

In order to use more of the power of R and RStudio, you will need to install packages (a one-time task), and load them (in each session) before use with a library(package_name) function.

If you Google a bit for ways to do things in R, you will find many packages that can be helpful. The most strictly validated packages are hosted on CRAN - a mirrored server. There are now over 20,000 packages on CRAN to do various specialized things in R. These were all useful for someone, so they have shared them on CRAN. To install packages from CRAN, you use the function:

install.packages("package_name")

Notice that the package_name has to be in quotes. These can be single or double quotes. The package_name and install.packages() are case_sensitive like all objects and functions in R, so that something like Install.Packages will not work.

Once the package is installed, you keep that in your R library associated with your current major version of R. You will need to update & reinstall packages each time you update a major version of R. R versions are designated with R version #.#.# A change in the third number indicates a minor version change. A change in the first or 2nd number (from R 3.6.2 to 4.0.0, or 4.0.2 to 4.1.0) is a major version upgrade which will require re-installation of packages.

Let’s practice installing a package. Run the code below to install the tidyverse package.

install.packages("tidyverse")
## 
## The downloaded binary packages are in
##  /var/folders/93/s18zkv2d4f556fxbjvb8yglc0000gp/T//RtmpWt7a0M/downloaded_packages

6.1.1 Installing Packages from Github

Some packages are still in development. These are often in repositories on github, rather than on the CRAN servers. To install these packages, you need to know path to the repository. You can install the medicaldata package from Github. Run the code below to install this package.

devtools::install_github("higgi13425/medicaldata")
## Using github PAT from envvar GITHUB_PAT
## Skipping install of 'medicaldata' from a github remote, the SHA1 (1c039d8b) has not changed since last install.
##   Use `force = TRUE` to force installation

In contrast, to install.packages, the library() function can work with quotes around the package_name, but they are not required. This is because these packages are already installed in your R library, and are known quantities. In general, known objects in your R environment do not require quotes, and novel things like packages do require quotes.

If you re-run print(.packages) at this point, you will not have any more packages. This is because you have installed new packages, but not loaded them.

6.1.2 Problems with Installing Packages

6.1.2.1 R Version Issues

Sometimes you may run into a problem installing a package which was developed for a previous version of R. Especially if you have recently upgraded your R version recently, the CRAN version of a package may be a bit behind. This can often be fixed by googling for “github” and “package_name”. This will usually lead you to the github repository for that package, which will have a pathname of “github_username/package_name”. Once you know this, you can use

`devtools::install_github(‘github_username/package_name’) to install the newest version of the package, which will usually be compatible with the latest version of R.

6.1.2.2 Installing from Source vs Binaries

6.1.2.3 Dependencies

Some packages are dependent on specific versions of other packages, and will ask you to update the other packages during installation. As a general rule, you should say ‘yes’. If you are worried about over-writing an existing package in a way that would break your code in a different project, then that project needs its own project-specific library, which you can create with the {renv} package.

6.1.2.4 Extra-R Dependencies

Sometimes packages require (depend upon) software that is not part of the R ecosystem. These will generally give you messages during the install process asking you to install this helper software. Common helper software includes things like Fortran and RJava. Sometimes you will need to go to websites, or use software like Homebrew (on the Mac) to install these extra helper pieces of software.

6.2 Loading Packages with Library

Run the code chunk below to load both {tidyverse} and {medicaldata}. Note that the {tidyverse} package is actually a meta-package that contains 8 packages, and each one has its own version number.

library(tidyverse)
library(medicaldata)

Notice that loading tidyverse led to some conflict messages. The dplyr::filter function masks the stats::filter() function. These two packages, {dplyr} and {stats}, both have a function named filter(). The more recently loaded package is assumed to be the default, so if you call a filter() command, R will use dplyr::filter(). If you want to call the stats::filter() command, you have to explicitly use the package::function() format. If you are not sure which package you loaded last, it can be wise to use the explicit format when calling functions in R.

The other masked function is lag(). The function dplyr::lag() is masking stats::lag(), as {dplyr} was loaded after {stats}. Most of the time this is not a big difference, but every once in a while a conflict between package functions can get very confusing. When in doubt, use the explicit format, in which you call package::function() to make clear what you mean - dplyr::lag() vs. stats::lag().

Note that it is good practice to load all of your packages needed for an R script or an Rmarkdown (.Rmd) document at the beginning of the script or .Rmd. This allows someone else using your script or Rmd to check whether they have the needed packages installed, and install them if needed. In an Rmarkdown document, this is done in a special setup code chunk near the top of the document. If some of these packages are not on CRAN, it is good practice to add a comment (a statement after a hashtag) on how to install this package. For example, in a setup chunk that loads {tidyverse} and {medicaldata}, it is a good idea to add a comment on how to install {medicaldata}, which is not yet on CRAN. See the example below

library(tidyverse)
library(medicaldata)
# the {medicaldata} package can be installed with devtools::install_github('higgi13425/medicaldata')

6.3 Updating R

6.4 Updating RStudio

6.5 Updating Your Packages

7 Major R Updates (Where Are My Packages?)

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

8 Checking, Validating, And Asserting things about your Data

So you have imported your data! Great! Now to start the analysis!

Not so fast, cowboy!
First you need to validate your data.

8.1 Why Spend Time Validating Your Data?

It is much more exciting to make plots, to make interactive Shiny apps of your models to share on the web, and to knit your Markdown documents to Word or PDF.

But it turns out that most of the truly heinous, embarrassing errors in medical data analysis occur during the process of data wrangling.

Imagine being the star of some of these sordid tales.

After publishing a paper in JAMA in 2019, the authors share their SAS code on Github, and an interested critic noticed that they listed data for 73,000 kidney transplants in the US in one year. But someone familiar with the UNOS data knew that there are about 280,000 kidney transplants per year. During a merge step between two databases, SAS silently over-wrote much of the original data. This discovery led to a retraction of the paper and re-analysis, demonstraing a much smaller effect size. (Gander JC, Zhang X, Ross K, et al. Association between dialysis facility ownership and access to kidney transplantation. Retracted and replaced April 21, 2020. JAMA. 2019;322(10):957-973.) Twitter thread link here https://twitter.com/eric_weinhandl/status/1253127109830156289?s=20

After publishing a report of a randomized controlled trial in COPD in JAMA in 2018 (Aboumatar H, Naquibiddin M, Chung S, et al. Effect of a Program Combining Transitional Care and Long-term Self-management Support on Outcomes of Hospitalized Patients With Chronic Obstructive Pulmonary Disease. A Randomized Clinical Trial. Retracted and replaced Nov 12, 2018. JAMA. 2018;320(22):2335-2343.), the authors realized that they had miscoded the treatment arms. For their logistic regression analysis, they had to recode the treatment arms from 1 and 2 to 0 and 1. Unfortunately, they flipped the values, and interpreted their results as beneficial. When they realized that the mis-coding changed the result from beneficial to harmful, they reported it to the journal and retracted the paper.

After publishing a report on Best Practices for In-Hospital Cardiac Arrest in JAMA Cardiology, the authors found coding errors in their data. 9 hospitals of 130 had been misclassified, changing some of their associations. (https://jamanetwork.com/journals/jama/article-abstract/2764714)

A med student analyzing a dataset for the first time uses boolean statements to categorize values. But she does not realize that this Stata dataset used “99” for missing values.

8.1.0.1 Cleaning – names with janitor package to snake_case

8.1.0.1.1 A few words about tidyverse style

8.1.0.2 Finding Missing data – naniar and visdat packages

8.1.0.3 Validating data – validate package

8.1.1 Asserting properties of your data with assertr

8.1.1.1 Evaluating – str, glimpse

8.1.1.2 Exploring- skimr package

8.1.1.3 Histograms

8.1.1.4 Correlations – ggally extension of ggplot2, and corrr package

9 Time Series data with the Tidyverts Packages

Fun text here. All kinds of crazy examples. Time series with data from influenza pandemic of 1918-19, perhaps. This is a book for anyone in the medical field interested in analyzing the data available to them to better understand health, disease, or delivery of care. This could include nurses, dieticians, psychologists, and PhDs in related fields, as well as medical students, residents, fellows, or doctors in practice.
I expect that most learners will be using this book in their spare time at night and on weekends, as the medical school curriculum is already packed full, and there is no room to add skills in reproducible research to the standard curriculum. This book is designed for self-teaching, and many hints and solutions will be provided to avoid roadblocks and frustration.

9.1 Tsibble

Time series tibble

Tidyverts webpage

9.2 Fable

Tidy forecasting

9.3 Feasts

Feature extraction and Statistics

9.4 Slider

Rolling anaylsis with window functions.
Slider packagedown page

10 Descriptive Data Tables

In this chapter, we will focus on making the descriptive table of the participants in your study, often colloquially know as “Table One”, based on its usual placement in a medical manuscript.

Before we plunge in, I would like to make one point of warning. It is quite common in a multiple-arm randomized controlled trial to compare the distribution of particular baseline characteristics of the subjects between arms with a p value, usually in a column at the far right. This is silly, as this produces a whole column of p values, corresponding to the multiple comparisons performed. With 20 comparisons, by chance, you are likely to get one or more “significant” p values. These are not helpful or meaningful, and are considered bad statistical practice.

Let me quote the CONSORT guidelines on the publications of clinical trials.

“Unfortunately significance tests of baseline differences are still common; they were reported in half of 50 RCTs published in leading general journals in 1997. Such significance tests assess the probability that observed baseline differences could have occurred by chance; however, we already know that any differences are caused by chance. Tests of baseline differences are not necessarily wrong, just illogical. Such hypothesis
testing is superfluous and can mislead investigators and their readers. Rather, comparisons at baseline should be based on consideration of the prognostic strength of the variables measured and the size >of any chance imbalances that have occurred.”
CONSORT STATEMENT


Despite this, some journals and editors still ask for these p values. Please resist, and quote the CONSORT statement. If you must do this, please do it only under duress.

10.1 Making Table One

10.1.1 The tableby function in the arsenal package

10.1.2 The gtsummary package with flextable

This is a newer approach which offers many of the same features as tableby. The gtsummary package is a companion to/built upon the gt package, (“gt” for grammar of tables), which is supported by RStudio. The gtsummary package, like gt, is designed to produce nice html output with lots of nice formatting.

However, as a nice bonus, gtsummary includes a neat function as_flextable, which converts your resulting table into a flextable, which can be knit to a Microsoft Word Document or a Powerpoint presentation with Rmarkdown.

This means that you can make a table once, and be able to produce output in HTML for webpages, Microsoft Word for manuscripts, and MS Powerpoint for presentations from the same file without any conversion issues.

The only question is how and when you prefer to format your table. Both gt and flextable have great options for formatting your tables. You can do this in gt, then do as_flextable, or you can convert to a flextable first, then do your formatting. You can choose based on your comfort and familiarity with flextable vs. gt. Both have excellent explanatory websites, with flextable here and gtsummary here.

10.1.3 Example of how to build a Table 1 with gtsummary

In the window below, you can:

  • Expand the example below to View in full screen by clicking the four arrows icon (Esc to return to small version).
  • Go to the next slide or previous slide with the left and right arrow icons/keys (or hover your mouse over the window and slowly scroll up and down.)
  • Share on the internet by clicking the Share icon (3 circles connected by 2 lines).

Give it a try.

10.2 Making An Adverse Events Table

10.3 Making A Results Table

11 Comparing Two Measures of Centrality


A common question in medical research is whether one group had a better outcome than another group. These outcomes can be measured with dichotomous outcomes like death or hospitalization, but continuous outcomes like systolic blood pressure, endoscopic score, or ejection fraction are more commonly available, and provide more statistical power, and usually require a smaller sample size.
There is a tendency in clinical research to focus on dichotomous outcomes, even to the point of converting continuous measures to dichotomous ones (aka “dichotomania”, see Frank Harrell comments here), for fear of detecting and acting upon a small change in a continuous outcome that is not clinically meaningful.
While this can be a concern, especially in very large, over-powered studies, it can be addressed by aiming for a continuous difference that is at least as large as one that many clinicians agree (a priori) is clinically important (the MCID, or Minimum Clinically Important Difference).
The most common comparison of two groups with a continuous outcome is to look at the means or medians, and determine whether the available evidence suggests that these are equal (the null hypothesis). This can be done for means with Student’s t-test.
Let’s start by looking at the cytomegalovirus data set. This includes data on 64 patients who received bone marrow stem cell transplant, and looks at their time to activation of CMV (cytomegalovirus). In the code chunk below, we group the data by donor cmv status (donor.cmv), and look at the mean time to CMV activation (time.to.cmv variable). Run the code (using the green arrow at the top right of the code chunk below) to see the difference in time to CMV activation in months between groups.

Try out some other grouping variables in the group_by statement, in place of donor.cmv. Consider variables like race, sex, and recipient.cmv. Edit the code and run it again with the green arrow at the top right.

# insert libraries in each chunk as if independent
library(tidyverse)
library(medicaldata)

cytomegalovirus %>% 
  group_by(sex) %>% 
  summarize(mean_time2cmv = mean(time.to.cmv)) ->
summ

summ
## # A tibble: 2 x 2
##     sex mean_time2cmv
##   <dbl>         <dbl>
## 1     0          13.7
## 2     1          12.7

That seems like a big difference for donor.cmv, between 13.7303333 months and 12.7441176 months. And it makes theoretical sense that having a CMV positive donor is more likely to be associated with early activation of CMV in the recipient. But is it a significant difference, one that would be very unlikely to happen by chance? That depends on things like the number of people in each group, and the standard deviation in each group. That is the kind of question you can answer with a t-test, or for particularly skewed data like hospital length of stay or medical charges, a Wilcoxon test.

11.1 Common Problem

  • Comparing two groups
    • Mean or median vs. expected
    • Two arms of study - independent
    • Pre and post / spouse and partner / left vs right arm – paired groups
  • Are the means significantly different?
  • Or the medians (if not normally distributed)?

11.1.1 How Skewed is Too Skewed?

  • Formal test of normality = Shapiro-Wilk test
  • Use base data set called ToothGrowth
library(tidyverse)
library(medicaldata)
data <- cytomegalovirus
head(data)
##   ID age sex race                    diagnosis
## 1  1  61   1    0       acute myeloid leukemia
## 2  2  62   1    1         non-Hodgkin lymphoma
## 3  3  63   0    1         non-Hodgkin lymphoma
## 4  4  33   0    1             Hodgkin lymphoma
## 5  5  54   0    1 acute lymphoblastic leukemia
## 6  6  55   1    1                myelofibrosis
##   diagnosis.type time.to.transplant prior.radiation
## 1              1               5.16               0
## 2              0              79.05               1
## 3              0              35.58               0
## 4              0              33.02               1
## 5              0              11.40               0
## 6              1               2.43               0
##   prior.chemo prior.transplant recipient.cmv donor.cmv
## 1           2                0             1         0
## 2           3                0             0         0
## 3           4                0             1         1
## 4           4                0             1         0
## 5           5                0             1         1
## 6           0                0             1         1
##   donor.sex TNC.dose CD34.dose CD3.dose CD8.dose TBI.dose
## 1         0    18.31      2.29     3.21     0.95      200
## 2         1     4.26      2.04       NA       NA      200
## 3         0     8.09      6.97     2.19     0.59      200
## 4         1    21.02      6.09     4.87     2.32      200
## 5         0    14.70      2.36     6.55     2.40      400
## 6         1     4.29      6.91     2.53     0.86      200
##   C1/C2 aKIRs cmv time.to.cmv agvhd time.to.agvhd cgvhd
## 1     0     1   1        3.91     1          3.55     0
## 2     1     5   0       65.12     0         65.12     0
## 3     0     3   0        3.75     0          3.75     0
## 4     0     2   0       48.49     1         28.55     1
## 5     0     6   0        4.37     1          2.79     0
## 6     0     2   1        4.53     1          3.88     0
##   time.to.cgvhd
## 1          6.28
## 2         65.12
## 3          3.75
## 4         10.45
## 5          4.37
## 6          6.87

11.1.2 Visualize the Distribution of data variables in ggplot

  • Use geom_histogram or geom_density (pick one or the other)
  • look at the distribution of CD3.dose or time.to.cmv
  • Bonus points: facet by sex or race or donor.cmv
  • Your turn to try it
library(tidyverse)
library(medicaldata)

data %>% 
ggplot(mapping = aes(time.to.cmv)) +
  geom_density() +
  facet_wrap(~sex) +
  theme_linedraw()

library(tidyverse)
library(medicaldata)
 
data %>% 
ggplot(mapping = aes(time.to.cmv)) +
  geom_histogram() +
  facet_wrap(~race)

11.1.3 Visualize the Distribution of data$len in ggplot

  • The OJ group is left skewed
  • May be problematic for using means
  • formally test with Shapiro-Wilk
library(tidyverse)
library(medicaldata)

data$time.to.cmv %>% 
shapiro.test()
## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.68261, p-value = 0.0000000001762

11.1.4 Results of Shapiro-Wilk

  • p-value = 0.1091
  • p not < 0.05
  • Acceptably close to normal
  • OK to compare means rather than medians
  • can use t test rather than wilcoxon test
    • if p is < 0.05, use wilcoxon test
    • also known as Mann-Whitney test
    • a rank-based (non-parametric) test

11.1.5 Try it yourself

  • use df <- msleep
library(tidyverse)
library(medicaldata)

df <- msleep 
head(df$sleep_total)
## [1] 12.1 17.0 14.4 14.9  4.0 14.4
  • test the normality of total sleep hours in mammals

11.1.6 Mammal sleep hours

library(tidyverse)
library(medicaldata)

shapiro.test(df$sleep_total)
## 
##  Shapiro-Wilk normality test
## 
## data:  df$sleep_total
## W = 0.97973, p-value = 0.2143
  • meets criteria - acceptable to consider normally distributed
  • now consider - is the mean roughly 8 hours of sleep per day?

11.2 One Sample T test

  • univariate test
    • Ho: mean is 8 hours
    • Ha: mean is not 8 hours
  • can use t test because shapiro.test is NS

11.2.1 How to do One Sample T test

library(tidyverse)
library(medicaldata)

t.test(df$sleep_total, alternative = "two.sided",
       mu = 8)
  • Try it out, see if you can interpret results

11.2.2 Interpreting the One Sample T test

## 
##  One Sample t-test
## 
## data:  df$sleep_total
## t = 4.9822, df = 82, p-value = 0.000003437
## alternative hypothesis: true mean is not equal to 8
## 95 percent confidence interval:
##   9.461972 11.405497
## sample estimates:
## mean of x 
##  10.43373
  • p is highly significant
    • can reject the null, accept alternative
    • sample mean 10.43, CI 9.46-11.41

11.2.3 What are the arguments of the t.test function?

  • x = vector of continuous numerical data
  • y= NULL - optional 2nd vector of continuous numerical data
  • alternative = c(“two.sided”, “less”, “greater”),
  • mu = 0
  • paired = FALSE
  • var.equal = FALSE
  • conf.level = 0.95
  • documentation

11.3 Insert flipbook for ttest here

Below is a flipbook.
It illustrates a bit of how to do a t-test.
click on it and you can use the arrow keys to proceed forward and back through the slides, as you add lines of code and more results occur.

Let’s start with a flipbook slide show. When the title slide appears, you can step through each line of the code to see what it does. The right/left and/or up/down arrows will let you move forward and backward in the code.

You can use the arrow keys to go through it one step at a time (forward or backward, depending on which arrow key you use), to see what each line of code actually does.

Give it a try below. See if you can figure out what each line of code is doing.

11.3.1 Flipbook Time!

This is t-testing in action.

11.4 Fine, but what about 2 groups?

  • consider df$vore
library(tidyverse)
library(medicaldata)
prostate <- medicaldata::blood_storage
tabyl(prostate$AA)
##  prostate$AA   n   percent
##            0 261 0.8259494
##            1  55 0.1740506
  • hypothesis - herbivores need more time to get food, sleep less than carnivores
  • how to test this?
    • normal, so can use t test for 2 groups

11.4.1 Setting up 2 group t test

  • formula interface: outcome ~ groupvar
library(tidyverse)
library(medicaldata)

df %>% 
  filter(vore %in% c("herbi", "carni")) %>% 
  t.test(formula = sleep_total ~ vore, data = .)
  • Try it yourself
  • What do the results mean?

11.4.2 Results of the 2 group t test

## 
##  Welch Two Sample t-test
## 
## data:  sleep_total by vore
## t = 0.63232, df = 39.31, p-value = 0.5308
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.911365  3.650509
## sample estimates:
## mean in group carni mean in group herbi 
##           10.378947            9.509375

11.4.3 Interpreting the 2 group t test

  • Welch t-test (not Student)
    • Welch does NOT assume equal variances in each group
  • p value NS
  • accept null hypothesis
    • Ho: means of groups roughly equal
    • Ha: means are different
    • 95% CI crosses 0
  • Carnivores sleep a little more, but not a lot

11.4.4 2 group t test with wide data

  • You want to compare column A with column B (data are not tidy)
  • Do mammals spend more time awake than asleep?
library(tidyverse)
library(medicaldata)

t.test(x = df$sleep_total, y = df$awake, data = msleep)

11.4.5 Results of 2 group t test with wide data

library(tidyverse)
library(medicaldata)

t.test(x = df$sleep_total, y = df$awake, data = msleep)
## 
##  Welch Two Sample t-test
## 
## data:  df$sleep_total and df$awake
## t = -4.5353, df = 164, p-value = 0.00001106
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.498066 -1.769404
## sample estimates:
## mean of x mean of y 
##  10.43373  13.56747

11.5 3 Assumptions of Student’s t test

  1. Sample is normally distributed (test with Shapiro)
  2. Variances are homogeneous (homoskedasticity) (test with Levene)
  3. Observations are independent
  • not paired like left vs. right colon
  • not paired like spouse and partner
  • not paired like measurements pre and post Rx

11.5.1 Testing Assumptions of Student’s t test

  • Normality - test with Shapiro
    • If not normal, Wilcoxon > t test
  • Equal Variances - test with Levene
    • If not equal, Welch t > Student’s t
  • Observations are independent
    • Think about data collection
    • are some observations correlated with some others?
    • If correlated, use paired t test

11.6 Getting results out of t.test

  • Use the tidy function from the broom package
  • Do carnivores have bigger brains than insectivores?
library(tidyverse)
library(medicaldata)
library(broom)

df %>% 
  filter(vore %in% c("carni", "insecti")) %>% 
t.test(formula = brainwt ~ vore, data = .) %>% 
  tidy() ->
result
result

11.6.1 Getting results out of t.test

## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>
## 1   0.0577    0.0793    0.0216      1.20   0.253        12
## # … with 4 more variables: conf.low <dbl>, conf.high <dbl>,
## #   method <chr>, alternative <chr>

11.7 Reporting the results from t.test using inline code

  • use backticks before and after, start with r
    • i.e. My result is [backtick]r code here[backtick].
  • The mean brain weight for carnivores was 0.0792556
  • The mean brain weight for herbivores was 0.02155
    • The difference was 0.0577056
  • The t statistic for this Two Sample t-test was 1.1995501
  • The p value was 0.2534631
    • The confidence interval was from -0.05 to 0.16

11.7.1 For Next Time

  • Skewness and Kurtosis
  • Review Normality
    • When to use Wilcoxon
  • Levene test for equal variances
    • When to use Welch t vs. Student’s t
  • Paired t and Wilcoxon tests

12 Running R from the UNIX Command Line

12.1 What is the UNIX Command line?

The command line is a simple Terminal window with a prompt at which you can type commands, And do primitive but powerful things to your files. The UNIX computing environment was developed in the 1960s, and is still beloved and fetishized by brogrammers, who believe you are not truly a programmmer if you can’t code from the command line. This is silly.

The major attraction to UNIX in the 1960s is that it was much better than punch cards. Which isn’t saying much. We have had 60 years of software advancement and user interface improvements, so we really should not have to put up with the inherent user hostility of the UNIX environment.

UNIX is an early operating system, which is built around a ‘kernel’ which executes operating system commands, and a ‘shell’ which interprets your commands and sends them to the kernel for execution. The most common shell these days is named ‘bash’, which is a silly recursive brogrammer joke. You will sometimes see references to shell scripts or shell or bash programming. These are the same thing as command line programming.

UNIX is a common under-the-hood language across many computers today, as the Apple iOS is built on top of UNIX, and the various versions of the LinuxOS are built on a UNIX-like kernel, with a similar command shell.

The command line is often the least common denominator between different pieces of open-source software that were not designed to work together. It can occasionally be helpful to build a data pipeline from mismatched parts. However, there is a lot of low-quality user-hostile command line work involved to get it done, often referred to as “command-line bullshittery”. This is a common bottleneck that slows scientific productivity, and there is a vigorous discussion of it on the interwebs here and here (counterpoint). Essentially, some argue that it is largely a waste of time and effort, while others see it as a valuable learning experience, like doing least squares regression by hand with a pencil.

Running R from the command line is a bit like spending a day tuning your car’s engine by yourself. There is a case to be made that this will improve the efficiency and performance of your car, but it is also usually more efficient to pay someone else to do it, unless you are a car expert with a lot of free time.

12.2 Why run R from the command line?

You can run R from the command line. It has none of the bells and whistles, nor any of the user conveniences of the RStudio Interactive Developer Environment (IDE). But it is how R was originally expected to be used when it was developed back in 2000 in New Zealand.

Running R from the command line allows you to do powerful things, like process multiple files at once, which can be handy when you have multiple files of sequencing data from distinct observations, or you have a multistep data wrangling pipeline with several slow steps. For many years, this was the only way to easily apply code across multiple files to build a complex data pipeline.

This is much less true today, with tools to handle file paths like the {here} and {fs} packages, run Python scripts from R with the {reticulate} package, run C++ scripts with Rcpp, and run bash, python, SQL, D3, and Stan scripts from Rmarkdown. You can use the {drake} package to manage multi-step data pipelines in different languages (similar to make). But some labs have been doing things at the command line for years, and find it hard to change.

12.3 How do you get started?

First, you need to open a terminal window. And to do that, you need to find it. This is akin to getting under the hood of a car, and computer makers don’t exactly encourage it.

12.3.1 On a Mac

  • Go to Finder/Applications/Utilities/Terminal

12.3.2 On a Windows PC

  • Go to Applications/Terminal

12.4 The Yawning Blackness of the Terminal Window

So, you have managed to open a terminal window, which has a standard UNIX prompt, ending in something like % or $. Not terribly helpful, is it? The bash shell is waiting for you to enter a command.
No user interface for you!

Let’s start with a simple one, which can’t do any harm. Run the command below:
whoami

whoami
## peterhiggins

Remember that UNIX started out as an operating system for terminals, and knowing who was logged in was a helpful thing.

You can string together two commands with a semicolon between them.

Try the following:

whoami;date
## peterhiggins
## Thu Dec 31 18:17:55 EST 2020

OK, fine. This is sort of helpful. It was really important when you were on a terminal and paying by the minute for time on a mainframe back in 1969. And, on occasion, if you will need to use an entire computer cluster to run a script (or scripts) on a lot of data, you will likely have to use some of this command line knowledge. You can even schedule jobs (scripts) to run when your time is scheduled on the cluster with cron and crontab.

At this point, it would be helpful to open a window with your Documents folder, and keep it side by side with the window in which you are reading this e-book. We will start working with files and directories, and it is helpful to see changes in your file/folder structure in real time. As we run commands in the bash shell, check them against what you see in the folder window. You may find that some files (dotfiles, starting with a period) are hidden from the user to prevent problems that occur when these are deleted.

12.5 Where Are We?

OK, let’s start looking at files and directories. Start with the pwd command, which does not stand for password, but for print working directory.

Run the code below in your Terminal window.

pwd
## /Users/peterhiggins/Documents/RCode/rmrwr-book

You can see the full path to your current directory. This can be a bit obscure if you are just looking at your folder structure, particularly at the beginning of the path. Fortunately, the {here} package handles a lot of this for you when you are working in Rstudio projects.
We think of the directory as a tree, with a root - in this case, Users, and various branches as you build out folders and subfolders.
We can move up and down the folders of the directory paths with the cd command, for change directory.

Try this command in your Terminal Window, and see if you can figure out what it does.

cd ..

It changes the directory up one level closer to the root directory. It is straightforward to go up the directory tree, as each folder only has one parent. But it is tricky to go down the directory tree, as there are many possible branches/children, and you do not inherently know the names of these branches. We need to list the contents of your current directory with ls to know what is there.

Try the ls command in your Terminal window

cd /Users/peterhiggins/Documents/;
ls
## 1FQ_Crohn's Disease_23Oct2020 (002).doc
## 2020-Jun-05 AGA IMIBD meeting notest.docx
## 2021 AGA Invited Speaker Session Basic Hybrid Example.pdf
## 2021.Higgins AGA Distinguished Clinician.CO.docx
## A is for Allspice.2.0.docx
## A is for Allspice.docx
## ABT263_HIO_report_toWord.docx
## AGA IMIBD
## AGA IMIBD Councilor Career Discussion Guide.docx
## AGA IMIBD Webinar Outline.docx
## AIBD CAM Higgins.pdf
## AIBD CAM Higgins.pptx
## AIBD SoMe Higgins.pdf
## AIBD SoMe Higgins.pptx
## AIBD agreement.docx
## AIBD20Template.pptx
## AMAG DDW Clear draft_PDRH comments.docx
## APG1244_Milestone_report.docx
## ASUC_UC_protocol_comments_2020.docx
## A_Woodward_Score Sheet_PDRH.docx
## Accounts and Access (1) (1).docx
## Advice for participants in webinars.docx
## Animation of NSAID.pptx
## BKochar_Frailty.pdf
## BM recommendation.docx
## Beginners_GuideToR.pdf
## Biosketch for K.pptx
## Biosketch_2020_Higgins_ClinResIBD_biosketch.doc
## Brazil.ItineraryNov2015.docx
## Butter BCS Chicken.docx
## CAS.K.candidate.background_SB_PDRH.docx
## CAS.T32.Project.Description-JS.docx
## CAS.career.goals.obj.development.training_PDRH.docx
## CC360_The Risk of SARS.R1.docx
## CC360_The Risk of SARS.docx
## CCF IBD Webcast 2020 Draft Deck_For Review.pptx
## CCFA EIC Candidate Interview Questions (candidates) jobin[1].doc
## CDC_proposal1.1.docx
## CLARE STOCKS.docx
## COVID Trials Feasibility
## CaltechCampus Tour & Information Session.webarchive
## Cancel Appt Epic.ppt
## Causal.png
## CellDeath_DDW_2021_ISS.pdf
## Chu RPG Review_PDRH.docx
## Clare Investment Summary.docx
## Council Conversations Author Chat Guide.docx
## Coursera_Programming in R Notes.docx
## CoverLetterPlus.pptx
## Crash&Burn_ScriptV2_100318 copy.pdf
## DataCamp Courses by Topic.docx
## DeEscalationACG2016.pptx
## Demographics.pdf
## Documents.Rproj
## DrHiggins IBD Data Request.xlsx
## Draft Postop IBD Surgery Care Protocols v2_SERedit.docx
## ECCO 2016 Amsterdam Schedule.docx
## ECCO 2019 UC PRO SS Abstract D1f_JP_UA_YO_AM_PDRH.docx
## ECCO2016Lycera30937.pptx
## Effect of medications on the recurrence of cancer in IBD patients.docx
## Electrical engineering interview questions.docx
## FDAtofaResponse.docx
## FFMI Kickstart-FinalReport 5-20-16-LJ.docx
## FITBITProtocol_28NOV2016_AbbVie.docx
## FITBITProtocol_4DEC2016_AbbVie.docx
## FMT_DDW_2021_ISS.pdf
## FibrosisIBDCedars2016.pptx
## Figures-KC-JAMA.pptx
## Finance and Retirement Plans.docx
## Financial Priorities.docx
## Garmin Notes.docx
## General Social Media Tips.docx
## General thoughts about query letters.docx
## Getting Started with REDCap.docx
## Git for MDs_2.pptx
## GitHub
## Github for MDs_1.pptx
## Glover_RPG_Review_PDRH.docx
## GoToMeeting Chats
## GradPartyHigginsInvites.xlsx
## HPI-5016 IBD Patient Contact Info.xlsx
## HS movie.docx
## Higgins AGA Webinar Slides.pptx
## Higgins Bio.docx
## Higgins New IBD.pptx
## Higgins Refractory Proctitis.pptx
## Higgins biosketch2015KRao.doc
## Higgins biosketch2016KRao.doc
## Higgins-peter.jpg
## HigginsACGMidwest2019_PerioperativeIBD.pptx
## Higgins_LOS_IBDBiobank_Shah_Nusrat_2019.docx
## Higgins_UM_CME_Pregnancy in IBD.pptx
## How To Log in to IBD Server.docx
## How To Log in to RStudio Server for HigginsLab.docx
## How To Log in to RStudio Server for Shiny.docx
## IBD 2020 - Honorarium reimbursement Form.docx
## IBD Biobank Cryostor.pptx
## IBD Clinical Trials for MDsDearborn2017.pptx
## IBD Insurance Pilot Results.docx
## IBD Insurance Survey for CCFA Partners Existing.docx
## IBD Journal Club 13Feb2017.docx
## IBD Journal Club July 11.docx
## IBD Plexus meeting 21 Sep 2015 notes.docx
## IBD School 322 Script.docx
## IBD School 324 Script.docx
## IBD School 325 Script.docx
## IBD and biologics tweets.docx
## IBD inbox coverage.docx
## IBDInsuranceSurvey3.docx
## IBDMentoringConferenceCall4AbstractsPH.docx
## IBD_Deescalation_Apr_2019_PDRH.docx
## IBDforLansing2017.pptx
## IMG_0006.jpg
## IMG_0008.jpg
## IMG_1523st.jpg
## IMIBD Councilors 2020-21.docx
## IMIBD Partners insurance 2020DDW.pptx
## IMIBD_expanded_descriptors.xlsx
## Introduction to Application Supplement Photoacoustic.docx
## JAK_DDW_2021_ISS.pdf
## JAMA_KC_Second JAMA.docx
## JAMA_Review_on_CD_Revisions_Tracked_Changes with edits_PDRH.docx
## JB_V1 Career Goals and Objectives 7.8.2020_PDRH.docx
## JB_V2 Candidate’s Background 7.7.2020_PDRH.docx
## JDix_Study_update.docx
## K Award Institutional Letter of Commitment.pptx
## K Candidate Section.pptx
## K105_Melmed_PROs in Practice_MB_bb_JLS.pptx
## K23 Aims - Shirley Cohen-Mekelburg 11.14.19.docx
## K23_morph_measurements_MockupManuscript_21JAN2019.docx
## Learning R discussion Jeremy Louissaint.docx
## Letter to Frank Hamilton.docx
## Lin_Reviewer Score_PDRH.docx
## Log in to IBD Server.docx
## MEI_2020_PH_W9.pdf
## MEI_ACH_Wire Transfer Form.docx
## MIM-TESRIC PROTOCOL_Higgins_14Apr2020.docx
## MIM-TESRIC PROTOCOL_Higgins_26Aug2020.docx
## Managment of CD.pptx
## Manuscript v1.docx
## Manuscript v2.PDRH.docx
## McDonald, Nancy.pdf
## Megan McLeod Rec Letter Residency.docx
## MentoringAgendaDraftPH.docx
## Meta analysis TB vs CD version 3.5.docx
## Michigan Medicine Gastroenterology Social Media Initiative.docx
## Michigan Medicine Model for COVID-19 Clinical Trial Oversight DRAFT (KSB 04.17.20)-AL-PDRH.docx
## Microsoft User Data
## MultidisciplinaryIBDClinicPHv2.docx
## NordicTrackTC9iTreadmillManual.pdf
## Oct2019payPDRH.PDF
## Odd college lists.docx
## P Singh K grant aims 8-25_PDRH.docx
## P2PEP slide 2020
## P2PEP slide 2020.pptx
## PHcv2019.docx
## PHcv2020.docx
## PRO agenda videos VINDICO.docx
## PRO letter.docx
## PS_K grant aims 6-25_PDRH.docx
## PTM LOS From PDRH.docx
## PTM LOS From PDRH.pdf
## Pearson 5 Notes.docx
## Perils of Excel.pptx
## Personal statement version 3!.docx
## Pitch Letter - S is for Saffron.docx
## Poppy Eulogy backup.docx
## Poppy Eulogy.docx
## Possible Eastern College Tour.docx
## Powerpoint
## Prashant Rec Letter.docx
## Prashant Rec Letter.pdf
## PredictingIBD_DDW_2021_ISS.html
## PredictingIBD_DDW_2021_ISS.pdf
## Purdue Disclosure Form_Higgins.docx
## Question 16.docx
## RCode
## Ramp up clinical research_PH.xlsx
## Ramping up human subject research - MM 6-1-20 _KDA_PDRH_suggestions.docx
## Recordings
## Review Criteria for COVID Clinical Trials.docx
## Review guidelines_2017.docx
## Roasted Salted Cashews.docx
## S is for Saffron 3.0.docx
## S is for Saffron 3.1.docx
## S is for Saffron 3.2.docx
## S is for Saffron.2.0.docx
## SEAN STOCKS.docx
## SIG_Template_IBD Program_FINAL.docx
## Sean Common App academic honors list.docx
## Sean Common App activities list.docx
## Sean Higgins Bordogni.mp4
## Sean Higgins Brag Sheet.docx
## Sean Investment Summary.docx
## Sean Resume Tabular VBorder.docx
## Sean Resume Tabular.docx
## Sean Resume.docx
## Sean Summer Priorities 2016.docx
## SecureIBD.pptx
## ShareRmd.html
## Sherman Prize Nominee Questions.docx
## Shoreline West Tour Information.docx
## Short PA slides.pptx
## Shotwave thread.docx
## Signing Clinical Research Infusion Orders.pdf
## SingleCell_DDW_2021_ISS.pdf
## SoMe_use_2020.png
## Social Media for GI.pptx
## Source Code PT1.docx
## Stelara paper.docx
## T32_current_text_14June2019.docx
## TOPPIC ML draft v5SCM_YL_AKW_PDRH.docx
## TabaCrohn IBD J club.docx
## Tables.docx
## Takeda_IBD School Videos_Submission.pdf
## Task List 2020-2.docx
## Task List 2020-5.docx
## Task List 2020.docx
## Testing signatures with Adobe.pdf
## The Risk of SARS.R1.Markup.docx
## Tidymodels.docx
## Tofa in ICI Figure Legends_Final Draft_V2.docx
## Tofa inpatient induction Protocol_02NOV2018_PHforEdits.docx
## Toffee Separation Tips.docx
## UCRx_DDW_2021_ISS.pdf
## UC_protocol_comments_2020.docx
## UM IBD Clinical Trials IBD referral form.docx
## UPA_U_ACHIEVE 1st draft_PDRH.docx
## VINDICO_PRO.pptx
## VideoVisitSchedulingQuickApptsforProviders.pdf
## VincentChen_K specific aims 2020-10-25.docx
## VirtualPtEdMar2020.v2.pdf
## WebEx
## Zoom
## Zwift
## Zwift-Gift-Card.pdf
## aga institute council july 2020 meeting.pdf
## algorithms_thiopurine.pdf
## base-r-cheatsheet.pdf
## biomakers_fibrosisPDRH.docx
## bmj_imputation.pdf
## cgh_factors_utilization.pdf
## cycling core exercises.docx
## draft_tokenization letter Risa_Uste.docx
## early-career-faculty_Dec-2020.xlsx
## epic cancel_reschedule appointments.ppt
## epic schedule viewing_close.ppt
## escalator.html
## fellow graduation 2020.docx
## hexStickers.jpg
## higgins2x3.jpg
## iBike Rides
## learnr app diagram.jpg
## learnr app diagram.pptx
## letter Lowrimore.docx
## mockstudy manuscript draft.docx
## nejm1966_beecher_ethics.pdf
## nejm_indomethacin.pdf
## nejm_statins.pdf
## pdrh_IBD_email.xlsx
## personal statement fellowship_PDRH.docx
## peterhiggins.jpg
## seq-6.pdf
## signature.docx
## signature.fld
## signature.html
## signature.pdf
## signature.png
## stiff_bcl.R
## submitJanssen_IBD School Videos_12Jul2018.pdf
## tidyr_pivot.png
## tidyr_pivot.xcf
## ucla1.jpg
## untidy_sheets.pptx
## wga_min20.pdf
## ~$T Review Higgins.docx
## ~$sk List 2020-5.docx
## ~$sk List 2020.docx

You will see a listing of all files and folders in the current directory. You can get more details by adding the option (sometimes called a flag) -l

cd /Users/peterhiggins/Documents/;
ls -l

The full listing will give you more details, including read & write permissions, file size, date last saved, etc.
Many commands have options, or flags, that modify what they do.

Find a folder inside of your Documents folder. We will now go down a level in the directory tree. In my case, I will use the Powerpoint folder.
In your Terminal window:

  • change the directory to the Powerpoint directory
  • list the contents of this folder
cd /Users/peterhiggins/Documents/Powerpoint;
ls
## 2016IBDClinTrialsforMDsDearborn.pptx
## 2016IntegratedDeckorMDsGB.pptx
## 2019 SCSG GI Symposium IBD SoA  -  Read-Only.pptx
## BE LGD Dearborn 2016.04.12.pptx
## Getting Started in RStudio.pptx
## Higgins Microbiota for IBD Patient Ed.pptx
## HigginsDec2018AJG_SmokingStatus.pptx
## IBDUpdate.pptx
## Integrated Slide Deck Dearborn 2016.04.12.pptx
## MER Stress Management Dearborn 4-14.pptx
## MichiganMedicine-IBDTemplate.potx
## PDRH RCAR 2020.pptx
## PennThioMTX2017Higgins.pptx
## Pregnancy in IBD.pptx
## Regenbogen CRS for GI CME Course2016.pptx
## Senior Slide Show.pptx
## ThomsonRectalStumpComplicationsIBD2_13.pptx
## UEGweek2020.pptx
## UMHS Talk- Moving Beyond AntiTNF 4-2016 FINAL v2.pptx
## Vertebrate Animals for K.pptx
## VirtualPtEdMar2020.v2.pptx
## Writers Room.pptx
## ibd_meds_surgery_metan.pptx

Great!
You moved to a new directory and listed it.
Now we will get fancy, and make a new directory within this directory with the mkdir command.

Try this in your Terminal window:

pwd;
mkdir new_files;
ls

You have now made a new directory (folder) within the previous directory, named new_files. Verify this in your Documents folder.
You can now change to this directory
and list the contents (it should be empty).

Try this out in your Terminal Window (note edit the cd command to your own directory path).

cd /Users/peterhiggins/Documents/Powerpoint/new_files;
ls

Note that you can abbreviate the current directory with ., so that you could have also used cd ./new_files

You can create a new (empty) file in this directory with the touch command. Sometimes you need to create a new file, then write data to it.

Try this out

touch file_name;
ls

You can also create a file with data inside it with the cat > command.

Type in the following lines into your Terminal window. When complete, type control-D to be done and return to the Terminal prompt. cat stands for concatenate.

cat > file2.txt
cat1
cat2
cat3

Now you can list the contents of this file with the cat command below.

Give this a try

cat file2.txt

You can also list the directory of your new_files folder with ls to see the new folder contents.

Try this

ls

Note that you don’t need to use the Terminal to run bash commands. You can do this from an Rmarkdown file.
Take a moment to run pwd in your Terminal, to get the current directory.

Now open Rstudio, and a new Rmarkdown document.
Copy the path to the current directory from the Terminal.
Switch back to the Rmarkdown document.
Select one of the R code chunks (note the {r} at the top) and delete it.
Now click on the Insert dropdown at the top of the document, and insert a Bash chunk.
Now add UNIX commands (separated by a semicolon), like

cd (paste in path here);
pwd;
ls;
cat file2.txt

Then run this chunk.

Now you can run terminal commands directly from Rmarkdown!

12.6 Cleaning Up

OK, now we are done with the file file2.txt and the directory new_files. Let’s get rid of them with rm (for removing files) and rmdir for removing directories.
In order, we will - Make sure we are in the right directory - remove the file with rm file2.txt - go up one level of the directory with cd .. - remove the directory with rmdir new_files

Give this a try

pwd;
rm file2.txt;
cd ..;
rmdir new_files

Verify all of this in your Documents window.
This is great. But you can imagine a situation in which you mistakenly rm a file (or directory) that you actually needed. Unlike your usual user interface, when a file is removed at the command line, it is gone. It is not in the trash folder. It is gone. There is something to be said for modern user interfaces, which are built for humans, who occasionally make mistakes. Sometimes we do want files or folders back.

12.7 Other helpful file commands

Here are some file commands worth knowing

  • cat filename - to print out whole file to your monitor
  • less filename - to print out the first page of a file, and you can scroll through each page one at a time
  • head filename - print first 10 lines of a file
  • tail filename - print last 10 lines of a file
  • cp file1 file2 - copy file1 to file2
  • mv file1.txt file.2.txt file3.txt new_folder - move 3 files to a new folder

12.8 What about R?

So now you can get around directories, and find your files in the Terminal window, but you really want to run R.
You can launch an R session from the Terminal Window (if you have R installed on your computer) by typing the letter R at the Terminal prompt

Launch R

R

You get the usual R intro, including version number, and the R> prompt.
Now you can run R in interactive mode with available datasets, or your own datasets.

Try a few simple commands with the mtcars dataset.
Give the examples below a try.

You can use q() to quit back to the terminal (and reply “n” to not save the workplace image).

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0
##                   gear carb
## Mazda RX4            4    4
## Mazda RX4 Wag        4    4
## Datsun 710           4    1
## Hornet 4 Drive       3    1
## Hornet Sportabout    3    2
## Valiant              3    1
mtcars %>% 
  filter(mpg > 25) %>% 
  select(starts_with('m')|starts_with('c'))
##                 mpg cyl carb
## Fiat 128       32.4   4    1
## Honda Civic    30.4   4    2
## Toyota Corolla 33.9   4    1
## Fiat X1-9      27.3   4    1
## Porsche 914-2  26.0   4    2
## Lotus Europa   30.4   4    2

12.9 What about just a few lines of R?

Sometimes you will want to call R, run some code, and be done with R.
You can call R, run a few lines, and quit in one go.
Just add the flag -e (for evaluate) to the call to R,
and put the R commands in quotes.

Try the example below (note that this will not work if you are still in R - be sure you are back in the terminal with the % or $ prompt)

R -e "head(mtcars)"

or this example - note that single or double quotes does not matter - as long as they match.

Try this

R -e 'install(palmerpenguins)'

You can also string together several commands with the semicolon between them.

Try the example below.

R -e 'library(palmerpenguins);data(penguins);tail(penguins)'

12.10 Running an R Script from the Terminal

Now we are stepping up a level - you have an R script that you have carefully created and saved as the myscript.R file. How do you run this from the Terminal?
This is easy - just call the Rscript command with your file name.

Pick out a short R file you have written, make sure you are in the right directory where the file is, and use it as in the example below.

Rscript myscript.R

This launches R, runs your script, saves resulting output (if your script includes save or ggsave commands), closes R, and sends you back to the Terminal. Very simple.

12.11 Rendering an Rmarkdown file from the Terminal

This is a little different, as you can’t just run an Rmarkdown file. Normally you would use the dropdown button to knit your file from Rstudio. But you can use the rmarkdown::render command to render your files to HTML, PDF, Word, Powerpoint, etc. Pick out a simple Rmd file like output_file.Rmd below, make sure you are in the right directory where the file is, and try something like the example below.
Note that this is one case where nesting different types of quotes (single vs. double) can come in handy.
It helps to use single quotes around your filename and double quotes around the rmarkown::render command.

Try it out

Rscript -e "rmarkdown::render('output_file.Rmd')"

So there you have it!
Just enough to get you started with R from the command line.

13 Building and Publishing a {bookdown} book on bookdown.org

This book is published on bookdown.org, where you can create an account to publish your own e-book and share it with the world.

Once you have an account,

13.1 Setting up

Install the {bookdown} package, with install.packages('bookdown').

Then run library(bookdown) in the Console to load the package.

Then, in the RStudio IDE, Choose File/New Project/Book Project using bookdown.

Then go to the Files tab, open index.Rmd, and click the Knit button. The Preview Window will show you a minimal example of a bookdown book. You can start editing and adding chapters.

13.2 Bookdown YAML

You can edit your _bookdown.yml file, which controls the setup of your book. My _bookdown.yml file looks like this:

book_filename: "rmrwr"
title: "Reproducible Medical Research with R"
language:
  ui:
    chapter_name: "Chapter "
delete_merged_file: true
new_session: yes
rmd_files:
- index.Rmd
- io02-getting-started.Rmd
- io03-tasting.Rmd
- io65-error_messages.Rmd
- io04-updating.Rmd
- io07-major-updates.Rmd
- io08-data-validation.Rmd
- io09-timeseries.Rmd
- io10-tableOne.Rmd
- io30-ttest.Rmd
- io70-r_cmd_line.Rmd
- io98-title-holder.Rmd
- io99-references.Rmd

13.3 Output YAML

You can edit your _output.yml file, which controls the output and look of your book. My _output.yml file looks like this:

bookdown::gitbook:
  css: style.css
  config:
    toc:
      before: |
        <li><a href="./">RMRWR</a></li>
      after: |
        <li><a href="https://github.com/rstudio/bookdown" target="blank">Published with bookdown</a></li>
    edit: https://github.com/rstudio/bookdown-demo/edit/master/%s
    download: ["pdf", "epub"]
bookdown::pdf_book:
  includes:
    in_header: preamble.tex
  latex_engine: xelatex
  citation_package: natbib
  keep_tex: yes
bookdown::epub_book: default

Note that this refers to a style.css file, which affects the appearance of your book.

13.4 Styles.css

My style.css file looks like this:

@import url('https://fonts.googleapis.com/css?family=Abril+Fatface|Source+Sans+Pro:400,400i,700,700i|Lora:400,400i,700,700i&display=swap');

p.caption {
  color: #777;
  margin-top: 10px;
}
p code {
  white-space: inherit;
}
pre {
  word-break: normal;
  word-wrap: normal;
}
pre code {
  white-space: inherit;
}

/*  Desiree custom css  */

/* next 3 rules for setting large image at top of each page and pushing book content to appear beneath that */
/*
.hero-image-container {
  position: absolute;
  top: 0;
  left: 0;
  right: 0;
  height: 390px;
  /*background-image: url("images/books.jpg");
  background-color: #2F65A7;
}*/

/*.hero-image {
  width: 100%;
  height: 390px;
  object-fit: cover;
}*/

/*.page-inner {
  padding-top: 440px !important;
}*/

/* Links */

.book .book-body .page-wrapper .page-inner section.normal a {
  color: #702082;
}


/* Body and header text */

.book.font-family-1 {
  font-family: 'Source Sans Pro', arial, sans-serif;
}

h1, h2, h3, h4 {
  font-family: 'Lora', arial, sans-serif;
}


.book .book-body .page-wrapper .page-inner section.normal h1,
.book .book-body .page-wrapper .page-inner section.normal h2,
.book .book-body .page-wrapper .page-inner section.normal h3,
.book .book-body .page-wrapper .page-inner section.normal h4,
.book .book-body .page-wrapper .page-inner section.normal h5,
.book .book-body .page-wrapper .page-inner section.normal h6 {
    margin-top: 1em;
    margin-bottom: 1em;
}

.title {
  font-family: 'Lora';
  font-size: 3em !important;
  color: #2f65a7;
  margin-top: 0.275em !important;
  margin-bottom: 0.35em !important;
}

.subtitle {
  font-family: 'Lora';
  color: #2f65a7;
}


/* DROP CAPS*/


/*p:nth-child(2):first-letter {   /* /* DROP-CAP FOR FIRST P BENEATH EACH H1 OR H2*/ /*
  color: #2f65a7;
  float: left;
  font-family: 'Abril Fatface', serif;
  font-size: 7em;
  line-height: 65px;
  padding-top: 4px;
  padding-right: 8px;
  padding-left: 3px;
  margin-bottom: 9px;
}
*/

/* try the below with the ~ instead...or just the space?) */

.section.level1 > p:first-of-type:first-letter { /*drop cap for first p beneath level 1 headers only within class .section*/
  color: #2f65a7;
  float: left;
  font-family: 'Abril Fatface', serif;
  font-size: 6em;
  line-height: 65px;
  padding-top: 4px;
  padding-right: 8px;
  padding-left: 3px;
  margin-bottom: 9px;
}

/* add drop cap to first paragraph that follows the first 2nd level header*/
/*
.section.level2:first-of-type > p:first-of-type:first-letter {
  color: #2f65a7;
  float: left;
  font-family: 'Abril Fatface', serif;
  font-size: 7em;
  line-height: 65px;
  padding-top: 4px;
  padding-right: 8px;
  padding-left: 3px;
  margin-bottom: 9px;
}
*/



/* TOC */

.book .book-summary {
  background: white;
  border-right: none;
}

.summary{
  font-family: 'Source Sans Pro', sans-serif;
}

/* all TOC list items, basically */
.book .book-summary ul.summary li a, .book .book-summary ul.summary li span {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 15px;
  padding-right: 15px;
  color: #00274c;
}

.summary a:hover {
  color: #ffcb05 !important;
}

.book .book-summary ul.summary li.active>a { /*active TOC links*/
  color: #d86018 !important;
  border-left: solid 4px;
  border-color: #d86018;
  padding-left: 11px !important;
}


li.appendix span, li.part span { /* for TOC part names */
  margin-top: 1em;
  color: #000000;
  opacity: .9 !important;
  text-transform: uppercase;
}

.part + li[data-level=""] { /* grabs first .chapter immediately after .part...but only those ch without numbers */
 text-transform: uppercase;
}



ul.summary > li > a { /* The > selects all the li's which are immediately within the class summary*/
  font-family: 'Source Sans Pro', sans-serif;
}

/* The next two rules make the horizontal line go straight across in top navbar */

.summary > li:first-child {
    height: 50px;
    padding-top: 10px;
    border-bottom: 1px solid rgba(0,0,0,.07);
}

.book .book-summary ul.summary li.divider {
    height: 0px;
}


/* source code copy button */
.copy {
  width: inherit;
  background-color: #e2e2e2 ;
  border: none;
  border-radius: 2px;
  float: right;
  font-size: 60%;
  padding: 4px 4px 4px 4px;
}

/* Two columns */

.col2 {
  columns: 2 200px;         /* number of columns and width in pixels*/
  -webkit-columns: 2 200px; /* chrome, safari */
  -moz-columns: 2 200px;    /* firefox */
}


.side-by-side {
  display: flex;
}

.side1 {
  width: 40%;
}

.side2 {
  width: 58%;
  margin-left: 1rem;

}

/* -------------- div tips-------------------*/

div.warning, div.tip, div.tryit, div.challenge, div.explore {
  border: 4px #dfedff; /* very light blue */
  border-style: solid;
  padding: 1em;
  margin: 1em 0;
  padding-left: 100px;
  background-size: 70px;
  background-repeat: no-repeat;
  background-position: 15px center;
  min-height: 120px;
  color: #00274c; /* blue text */
  background-color: #bed3ec; /* light blue background */
}

div.warning {
  background-image: url("images/warning.png");
  background-color: #f7f7f7; /* gray97 background */
}

div.tip {
  background-image: url("images/tip.png");
  background-color: #fff7bc; /* warm yellow background */
}

div.tryit {
  background-image: url("images/tryit.png");
  background-color: #edf8fb; /* light blue background */
}

div.challenge {
  background-image: url("images/challenge.png");
  color: #4b0082; /* indigo text */
  background-color: #ffe1ff; /* thistle background */
}

div.explore {
  background-image: url("images/explore.png");
  background-color: #d0faee; /* green card background */
}

/* .book .book-body .page-wrapper .page-inner section.normal is needed
   to override the styles produced by gitbook, which are ridiculously
   overspecified. Goal of the selectors is to ensure internal "margins"
   controlled only by padding of container */

.book .book-body .page-wrapper .page-inner section.normal div.rstudio-tip > :first-child,
.book .book-body .page-wrapper .page-inner section.normal div.tip > :first-child {
  margin-top: 0;
}

.book .book-body .page-wrapper .page-inner section.normal div.rstudio-tip > :last-child,
.book .book-body .page-wrapper .page-inner section.normal div.tip > :last-child {
  margin-bottom: 0;
}

iframe {
   -moz-transform-origin: top left;
   -webkit-transform-origin: top left;
   -o-transform-origin: top left;
   -ms-transform-origin: top left;
   transform-origin: top left;
}

.iframe-container {
  overflow: auto;
  -webkit-overflow-scrolling: touch;
  border: #ddd 2px solid;
  box-shadow: #888 0px 5px 8px;
  margin-bottom: 1em;
}

.iframe-container > iframe {
  border: none;
}

13.5 Creating Chapters in R Markdown

Each chapter was created in R Markdown, with R code chunks, flipbooks, an learnr apps as exercises.

Note that each chapter should start with a level 1 header, which will be the title of the chapter. Each level 1 header starts with a single hashtag, then a space, then the text of the title.

You can save draft chapters without necessarily publishing them to the final book. They will not be included until you list them in your _bookdown.yml file.

After saving and knitting each chapter successfully, the finalized chapters can be included in the book build, and ordered, by adding them to the _bookdown.yml file, in between index.Rmd, and io98-title-holder.Rmd.

13.6 Chapter Names

The names of each chapter follow the convention, io##-Topic.Rmd. This is so that they will alphabetically follow index.Rmd and largely be in order.

13.7 When a Chapter is Ready for Sharing

Add the new chapter to the list of chapters in order in _bookdown.yml, somewhere in between
- index.Rmd and
- io98-title_holder.Rmd

13.8 Building the Book

Render the book with bookdown::render_book('index.html')

13.9 Publishing the Book

Publish the book with
bookdown::publish_book(account = 'pdr_higgins')

Then commit the changes and push to Github

Within a minute or three, the updated book will appear at:
https://bookdown.org/pdr_higgins/rmrwr/

More details can be found at:

https://bookdown.org/yihui/bookdown/rstudio-connect.html

and at

https://bookdown.org/home/about/

Title holder

References